United States Patent [i9] 

Hutchins ^ 



US0O5384893A 

[U] Patent Number: 
[45] Date of Patent: 



5,384,893 
Jan. 24, 1995 



[54] METHOD AND APPARATUS FOR SPEECH 
SYNTHESIS BASED ON PROSODIC 
ANALYSIS 

[75] Inventor: Sandra E. Hutchins, Del Mar, Calif. 

[73] Assignee: Emerson & Stern Associates, Inc., 
San Diego, Calif. 

[21] Appl. No.: 949,208 

[22] Filed: Sep. 23, 1992 

[51] Into.* G10L9/00 

[52] VS. Q 395/2.76; 395/2.67 

[58] Field of Search 381/51-53; 

395/2.67-2.78 

[56] References Cited 

U.S. PATENT DOCUMENTS 

3,704,345 11/1972 Coker et al 179/15 A 

4,214,125 7/1980 Mozer et aL 179/1 SM 

4,314,105 2/1982 Mozer \ : I 179/15.5$ R 

4,384,170 5/1983 Mozer et al.\ 179/1 SM 

4,433,434 2/1984 Mozer 1A.[ 381/30 

4,435,831 3/1984 Mozer : 1 381/30 

4,458,110 7/1984 Mozer ^J. 381/32 

4,624,012 11/1986 Linetal K 381/51 

4,685,135 8/1987 Linetal. LC 381/52 

4,692,941 9/1987 Jacks etal 1 381/52 

4,695,962 9/1987 Goudie „ 364/513.5 

4,797,930 1/1989 Goudie 381/52 

4,831,654 5/1989 Dick 381/51 

4,833,718 5/1989 Sprague 381/52 

4,852,168 7/1989 Sprague 381/35 

4,872,202 10/1989 Fette 381/52 

4,896,359 1/1990 Yamamoto et al. 381/52 

4,907,279 3/1990 Higuchietal „ 381/52 

4,912,768 3/1990 Benbassat ; 381/52 

4,964,167 10/1990 Kunizawa et al 381/52 

4,975,957 12/1990 Ichikawa et al 381/36 

OTHER PUBLICATIONS 

D. Klatt, "Software for a Cascade/Parallel Formant 
Synthesizer", /. Acoust Soc of Amer., vol. 67, pp. 
971-994 (Mar. 1980). 

D. Malah, "Time-Domain Algorithms for Harmonic 
Bandwidth Reduction and Time Scaling of Speech 
Signals", IEEE Trans, on Acoustic, Speech and Signal 
Processing, vol. ASSP-27, pp. 121-133 (Apr. 1979). 
F. Lee, 'Time Compression and Expansion of Speech 



by the Sampling Method", /. Audio EnggSoa, vol. 20, 
pp. 738-742 (Nov. 1972). 

T. Sakai et al., "On-Line, Real-Time, Multiple-Speech 
Output System", Proa Infl Fed for Info* Processing 
Cong. Booklet TA^4 Ljubljana, Yugoslavia (Aug. 1971) 
pp. 3-7. 

T. Tremain, "The Government Standard Linear Pre- 
dictive Coding Algorithm: LPC-10", Speech Technol- 
ogy, vol. 1, No. 2, pp. 40-49 (Apr. 1982). 

Primary Examiner — Allen R MacDonald 

Assistant Examiner— -Michelle Doerrler 

Attorney, Agent, or Firm — Burns, Doane, Swecker & 

Mathis 



[57] 



ABSTRACT 



A system for synthesizing a speech signal from strings 
of words, which are themselves strings of characters, 
includes a memory in which predetermined syntax tags 
are stored in association with entered words and pho- 
netic transcriptions are stored in association with the 
syntax tags. A parser accesses the memory and groups 
the syntax tags of the entered words into phrases ac- 
cording to a first set of predetermined grammatical 
rules relating the syntax tags to one another. The parser 
also verifies the conformance of sequences of the 
phrases to a second set of predetermined grammatical 
rules relating the phrases to one another. The system 
retrieves the phonetic transcriptions associated with the 
syntax tags that were grouped into phrases conforming 
to the second set of rules, and also translates predeter- 
mined strings of characters into words. The system 
generates strings of phonetic transcriptions and prosody 
markers corresponding to respective strings of the 
words, and adds markers for rhythm and stress to the 
strings, which are then converted into data arrays hav- 
ing prosody information on a diphone-by-diphone basis. 
Predetermined diphone waveforms are retrieved from 
memory that correspond to the entered words, and 
these retrieved waveforms are adjusted based on the 
prosody information in the arrays. The adjusted di- 
phone waveforms, which may also be adjusted for coar- 
ticulation, are then concatenated to form the speech 
signal. Methods in a digital computer are also disclosed. 

11 Claims, 11 Drawing Sheets 







WMOHwiy 


WW 








v id?0 




02/23/2004, EAST version: 1.4.1 



U.S. Patent 



Jan. 24, 1995 



Sheet 1 of 11 



5,384,893 



Text 
Input 

■■DD 



1005 



Dictionary 

Look-up 

Module 



Sentence 
Buffer 



in 



1080 



^1^11 



Word Dictionary 
with Phonetic 
Transcriptions 



-1020 



Syntactic Info 
& Transcrip tions 1030 



■ ^.1060 



1040 



Grammar 
Look-up 
Modules 



Grammar 
Tables 



Gra mmar & Synth Logs 



Phonetics 
Extractor 



1090 



, r Phonetic String with Prosody Markers 

1130 

u 



liiilli 



Prosody . 
Generator _y 



1100 

r 

1110 



Diphone 
Waveforms 



Diphone & Prosody Arrays 



Waveform 
Generator 



1.120 



Speech 
Output 



-1001 



1070 



Figure 1 



02/23/2004, EAST version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 2 of 11 5,384,893 






Skip 


Pointer to 


Location 


TAG N 


Tag 


Phonetic 


in 




Transcription 


Text 




High bit ON: end tags for this word 
High bit OFF: more tags follow 



Figure 2. 



* 


► 


Transl 


Trans2 


Trans3 




TransN 


Loci 


Loc2 


Loc3 




LocN 




u 


— »l« _ . . 





Number of Entries So Far 



Figure 4. 



02/23/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 3 of 11 5,384,893 



INPUT TEXT 



f 



33 



or 



w 



as 



NCOM 



Skip 

v- 

High bit OFF 



TransPtr 



Loc'n 



VB 



Skip 

y- 

High bit ON 



TransPtr 



k x 



Loc'n 



\ 



# 



Syntax Info 
and 

Transcriptions 



Segment of Dictionary 



Figure 3. 



02/23/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 4 of 11 5,384,893 



TransPtr 


Loc 


SI 


TransPtr 


Loc 


SI 


1 


1 


1 


2 


2 


2 




Syntaxjnfo Byte 
= 1 for End Sentence 
= 0 for Parse Failure 
Mid-Sentence 



Figure 5. 



Index ► 

0 1 2 3 4 5 6 N-1 

Diphone Number = DN I I I I I I I I ~ ~ ~ I 

Lexical Stress =LS 

Syntactic Stress =SS 

Syntactic Duration = SD 

Total Stress =TS 

Amplitucle =AM 

Duration Factor = DF 

First Pitch =P1 

Second Pitch = P2 



Figure 6. 



02/23/2004, east Version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 5 of 11 5,384,893 




>= len(pstr) ) — 

In 

Cpstr[n]isadigit?> - 

In 

C pstr[n} = T? > - JL 
, IN 
C pstr[n] = T? y- 1 - 
+ N [ 

C pstr[n] = '{'? y- 1 - 

] »N 

C pstr[n =T? ) 

In 



pull stress 
forward 




-+i Done 



LStr = value(pstr[n]) 



SStr = SStr-1 



SStr = SStr + 1 



SDur=SDuM r i[ - 



SDur = SDur - 1 



LS[j]=Lstr 
SS{j]=SStr 
SDQ] = SDur 



N 



(DiphoneNumber(pstr,n) >=0?) — Mlmp = pstr(n+1] 

telrln+1 1 = V 

DN[j]=DiphoneNumber(pstr,n) 

pstr[n+1] = tmp 



DN[j] = DiphoneNumber(pstr,n) 



- jn = n+1 



bstrtn] = , #' 
DNO]=DiphoneNumber(pstr,n) 

LS01=Lstr 
SS(j] = SStr 
SD[j] = SDur 



Figure 7a. 



02/23/2004, east version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 6 of 11 5,384,893 




DiphoneNumber(pstr,n) 




^ — (stdip has been made?) 



r pstr[j] is a digit 
or a bracket 
yor a brace? 

or 



c[k] = pstrlfl 
k = k+1 



C k>1? 



Tn 



> 



N 



start =stdip[c[0] & 0x7f 
j = start 

C start <0? > 
- — ^ N 



fl1] = DPM-name[1]? > 
flO] = DPu1.natrii6]?> -^ 



j=j+1 



H+1 




N i y 

— Q > number of diphones? — 



^^^return^T) 



Figure 7b. 



02/23/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 7 of 11 5,384,893 




Make stdip 



Cstdip[ DP[j].name[0]&0x7f ] = -1? > 



N 



1 



stdip[ DPO].name[0]&Ox7f ]=] 



j=] + 1 



-^- Q > total number of diphonesf^ )-^ — Done) 




Figure 7c. 



02/23/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 8 of 11 5,384,893 




Start 



k=0 



In 

DP[DN[k]].name[0] 




■M Done 



> 



LS[k] = 


LS[k+1] 


SS[k] = 


SS[k+1] 


SD[kJ = 


:SD[k+1] 



k=k+1 



Figure 7d. 



02/23/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 9 of 11 5,384,893 



110% Reference Pitch - 

Reference Pitch- 
98% Reference Pitch i 

90% Reference Pitch-^ ^ ^ ^ 

Start End Start End 

Sentence First Last Sentence 

Word Word 




Figure 8. 



02/23/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 24, 1995 Sheet 10 of 11 5,384,893 




TART 



) ► 


k=0 




j=o 




-W DONE 



ph = FirstChar[DN[j]]+SecondChar(DN[j+1l] 
new » DiphoneNumber(ph.O) 



/" new >= 0 \ n 
land i<N-1 ?J 



^FirstChar[DN[j]) * *#" 


and ^ 


^SecondChar[DN[j+1]] 


* T ?J 




' SecondChar(DN[j]] = 


and^ 


^FirstChar[DN|j+1]J = 


? J 





N 





DN|k] 


= new 


AM[k] 


= (AM0]+AM[j+1l)/2 


DF[k] 


= (DF(j)+DF[j+1])/2 


P2[k] 


= P2[j+1] 



f FirstChar[DNljl] is Voiced and^N Y 
Second Char[DN[j+1]] is Voiced? 

El 



DN(k] 


= DN(j] 


AM[k] 


= AM|j] 


DF[k][ 


= DFD] 


P1[k] 


= P101 


P2[k] 


= P2(j] 



FirstChar[DN[j]] is Voiced? 



|Pl[k] = P1[j+1] 



[k]=(P1 



|P1W = PI 01 



:j]+pi[j+ii)/2 



1=1+1 



k = k+1 
j = j+1 



Figure 9. 



02/23/2004, EAST Version: 1.4.1 



U.S. Patent Jan. 24, 1995 sheet ii of ii 5,384,893 



fW\A/\A l 

h : H « »H — — H 

'marked interval 'markori informal 'marked interval I 



original 
signal in 
SAMP 



marked interval 'marked interval 'marked interval 
in original in original in original Figure 10A. 

ad with zeroes raw 



yT^ad with zeroes 1 

* ► ^ — ^ ^. 

marked interval 



signal with 
lower pitch 



marked interval 
in original 



desired interval length 



marked interval 
in original 



Figure 10B. 



region of summation 



marked interval 
in original 



R/V-; 

< ► * ^ J 



first interval to overlap 



Figure 10C. 



second interval to overlap 



desired marked interval 
interval in original 
length 



Figure 10D. 



raw 

signal with 
higher pitch 



Figure 10E. 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

1 2 

example, the English strings "record" and "invalid" 

METHOD AND APPARATUS FOR SPEECH both have two pronunciations in phrases such as "to 

SYNTHESIS BASED ON PROSODIC ANALYSIS record a new record" and "the invalid's invalid check". 

In dealing with such problems, most prior TTS sys- 
BACKGROUND 5 terns either avoid or treat secondarily the problem of 

The present invention relates to methods and appara- varying the stress of output syllables. A TTS system 
tus for synthesizing speech from text. could ignore stress variations, but the result would 

A wide variety of electronic systems that convert text probably be unintelligible as well as sound unnatural, 
to speech sounds are known in the art Usually the text Some systems, such as that described in the Lin patent 
is supplied in an electrical digitally coded format, such ^ cited above, require that stress markers be inserted in 
as ASCII, but in principle it does not matter how the the text by outside means: a laborious process that de- 
text is initially presented. Every text-to-speech (TTS) feats many of the purposes of a TTS system, 
system, however, must convert the input text to a pho- "Stress" refers to the perceived relative force with 
netic representation, or pronunciation, that is then con- which a sound, syllable, or word is uttered, and the 
verted into sound. Thus, a TTS system can be charac- 15 pattern of stresses in a sequence of words is a highly 
terized as a transducer between representations of the complicated function of the physical parameters of 
text. Much effort has been expended to make the output frequency, amplitude, and duration. "Orthography" 
of TTS systems sound "more natural" viz more like refers to the system of spelling used to represent spoken 
speech from a human and less like sound from a ma- language. 

chine. # 20 In contrast to the approaches of prior TTS systems, it 

A very simple system might use merely a fixed dictio- believed that the accuracy of stress patterns can be 
nary of word-to-phonetic entries. Such a dictionary even more important than the accuracy of phonetics, 
would have to be very large in order to handle a suffi- Xo achieve stress pattern accuracy, however, a TTS 
ciently large number of words, and a high-speed proces- system muS t take into account that stress patterns also 
sor would be necessary to locate and retrieve entries 25 d d on gr2jmnAticaX role . For example, the English 
from &e dictionary with sufficiently high speed. character strings "address", "export", and "permit" 

To help avoid such drawbacks other systems, such as have 

stress patterns depending on whether 
that described m U.S. Pat. No. 4,685.135 o Lin et al., A m ^ ^ QQXjns QT verb$ Applicant's TTS sys- 
use a set of rules for conversion of words to phonettcs. tmcoiMm stress (and phonetic m ^ ^ 

In the Lin system, phonetics-to-sound conversion is 3D „ , . . . r , . . . . J . r 

ac^omplishe^^wiS: Slophones and linear predictive «?. f orthographic >™gula^ 
coding (LPQ, and stress marks must be added by hand lt **W and a natural-language 

in the input text stream. Unfortunately, a system using a P«*f ' which determines the grainmaticd role each 
simplistic set of rules for converting words to phonetic w0 ' d P ] * ys + m *? d ^ \ he P ronun - 

representations will inevitably produce erroneous pro- 35 **f<>* that corresponds to that grammatical role, 
nunciations for some words because many languages, 11 AxxM te a PP reciate d that Applicant's system does 
including English, have no simple relationship between more ^ merelv 311 exception dictionary larger; 
orthography and pronunciation. For example, the or- me presence of the grammatical information m the dic- 
thography, or spelling, of the English words "tough", tionary and the use of the parser result in a system that 
"though", and "through" bears little relation to their 40 is fundamentally different from prior TTS systems, 
pronunciation. Applicant's approach guarantees that the basic glue of 

Accordingly, some systems, such as that described in English is handled correctly in lexical stress and in 
U.S. Pat. No. 4,692,941 to Jacks et al., convert orthog- phonetics, even in cases that would be ambiguous with- 
raphy to phonemes by first examining a keyword dictio- out ^ parser. The parser also provides information on 
nary (giving pronouns, articles, etc.) to determine basic 45 sentence structure that is important for providing the 
sentence structure, then checlring an exception dictio- correct intonation on phrases and clauses, i.e., for ex- 
nary for common words that fail to follow the rules, and tending intonation and stress beyond individual words, 
then reverting to the rules for words not found in the t0 produce the correct rhythm of Hngl is h sentences, 
exception dictionary. In the system described in the The parser in Applicant's system enhances the accuracy 
Jacks et al. patent, the phonemes are converted to sound 50 of the stress variations in the speech produced among 
using a time-domain technique that permits manipula- other reasons because it permits identification of clause 
tion of pitch. The patent suggests that inflection, speech boundaries, even of embedded clauses that are not de- 
and pause data can be determined from the keyword limited by punctuation marks. 

information according to standard rules of grammar, Applicant's approach is extensible to all languages 
but those methods and rules are not provided, although 55 having a written form in a way that rule-based text-to- 
the patent mentions a method of raising the pitch of phonetics converters are not For a language like Chin- 
words followed by question marks and lowering the ese, in which the orthography bears no relation to the 
pitch of words followed by a periods. phonetics, this is the only option. Also for languages 

Another prior TTS system is described in U.S. Pat like Hebrew or Arabic, in which the written form is 
No. 4,979,216 to Malsheen et al., which uses rules for 60 only "marginally" phonetic (due, in those two cases, to 
conversion to phonetics and a large exception dictio- the absence of vowels in most text), the combination of 
nary of 3000-5000 words. The basic sound unit is the dictionary and natural-language parser can resolve the 
phoneme or allophone, and parameters are stored as ambiguities in the text and provide accurate output 
formants. speech. 

Such systems inevitably produce erroneous pronunci- 65 Applicant* s approach also offers advantages for lan- 
ations because many languages have words, or charac- guages (e.g., Russian, Spanish, and Italian) that may be 
ter strings, that have several pronunciations depending superficially amenable to rule-based conversion (Le., 
on the grammatical roles the strings play in the text For where rules might "work better" than for English be- 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

3 4 

cause the orthography corresponds more closely to the verted into data arrays having prosody information on a 
phonetics). For such languages, the combination of a diphone-by-diphone basis. 

dictionary and parser still provides the information on Predetermined diphone waveforms are retrieved 
sentence structure that is critical to the production of from memory that correspond to the entered words, 
correct intonational patterns beyond the simple word S and these retrieved waveforms are adjusted based on 
level Also for languages having unpredictable stress the prosody information in the arrays. The adjusted 
(eg., Russian, En g l is h, and German), the dictionary diphone waveforms, which may also be adjusted for 
itself (or the combination of dictionary and parser) re- coarticulation, are then concatenated to form the 
solves the stress patterns in a way that a set of rules speech signal. 

cannot 1° In another aspect of the invention, the system inter- 

Most prior systems do not use a full dictionary be- p re tg punctuation marks as requiring various amounts of 
cause of the memory required; the Lin et al. patent pausing, deduces differences between declarative, ex- 
suggests that a dictionary of English words requires 600 clamatory, and interrogative word strings, and places 
K bytes of RAM. Apphcant's dictionary with phonetic ^ deduced differences in the strings of phonetic tran- 
and grammatical information requires only about 175 K 15 senpuons and prosody markers. Moreover, the system 
bytes. Also, it is often assumed that a natural-language ^ ^ extra pauses after highly stressed words, adjust 
parser of English would be too tune consuming for duration before and stress following predetermined 
practical systems, punctuation, and adjust rhythm by adding marks for 

This invention is an innovative approach to the prob- more Qr less duration onto honetic t^^tions cor- 
lem of text-to-speech synthesis, and can be implemented 20 respondmg to selected syllables of the entered words 
^g^^ inm ^ Processing power available on based on ^ stress pattern of ±e selected syllables. 
^CINTOSH-type computers available from Appe ^ mclude d m the system can verify the 

Computer Corp The present TTS system is flexible of severa l pa raUel sequences of phrases 

enough to adapt to any language, mcluding languages d ^ combinations derived from the relieved 
such as English for which tiie relationship between 25 ^ of grammatical rules, each 

orthography and phonetics is highly irregular. It will be v\ 5 , .^.^^ . ' " 

aJprSd'S the present ^"system which has J ^ ^ l "WJ compnsmg a respective one of 
been configured to iZ on Motorola M68000 and Intel ^ sequences possible for the entered words. 
80386SX processors, can be implemented with any pro- another aspect, Apphcant's mvention provides a 

cesser, and has increased phonetic and stress accuracy 30 ^thodfor a digital computer for synthesizing a speech 
compared to other systems. from lan S u ag e sentences, each sentence 

Apphcant's invention incorporates a parser for a lim- ^ving at least one word. The method includes the steps 
ited context-free grammar (as contrasted with finite- of entering and stormg a sentence m the computer, and 
state grammars) that is described in Apphcant's com- A*** associated with the entered words in 

monly assigned U.S. Pat. No. 4,994,966 for "System and 35 a word dictionary. Non-tenninals associated with the 
Method for Natural Language Parsing by Initiating syntax tags associated with the entered words are found 
Processing prior to Entry of Complete Sentences" in a phrase table as each word of the sentence is entered, 
(hereinafter "the '966 patent"), which is hereby incor- 311(1 several possible sequences of the found non-termi- 
porated in this application by reference. It will be un- are tracked in parallel as the words are entered, 

derstood that the present invention is not limited in 40 The method also includes the steps of verifying the 
language or size of vocabulary; since only three or four conformance of sequences of the found non-tenninals to 
bytes are needed for each word, adequate memory ca- associated with predetermined sequences of non- 

party is usually not a significant concern in current terminals, and retrieving, from the word dictionary, 
small computer systems. phonetic transcriptions associated with the syntax tags 

45 of the entered words of one of the sequences conform- 
SUMMARY ing to the rules. Another step of the method is generat- 

In one aspect, Applicant's invention provides a sys- m g a string of phonetic transcriptions and prosody 
tern for synthesizing a speech signal from strings of markers corresponding to the entered words of that 
words, which are themselves strings of characters, en- sequence conforming to the rules, 
tered into the system. The system includes a memory in 50 The method further includes the step of adding mark- 
which predetermined syntax tags are stored in associa- ers for rhythm and stress to the string of phonetic tran- 
tion with entered words and phonetic transcriptions are scrip tions and prosody markers and converting the 
stored in association with the syntax tags. A parser string into arrays having prosody information on a di- 
accesses the memory and groups the syntax tags of the phone-by-diphone basis. Predetermined diphone wave- 
entered words into phrases according to a first set of 55 forms corresponding to the string and the entered 
predetermined grammatical rules relating the syntax words of the sequence conforming to the rules are then 
tags to one another. The parser also verifies the confor- adjusted based on the prosody information in the arrays, 
mance of sequences of die phrases to a second set of As a final step in one embodiment, the adjusted diphone 
predetermined grammatical rules relating the phrases to waveforms are concatenated to form the speech signal. 

^SSm retrieves the phonetic transcriptions asso- " BRIEF DESCRIPTION OF THE DRAWINGS 
ciated with the syntax tags that were grouped into The features and advantages of Applicant's invention 

phrases conforming to the second set of rules, and also will be understood by reading the following detailed 

translates predetermined strings of characters into description in conjunction with the drawings in which: 
words. The system generates strings of phonetic tran- 65 FIG. 1 is a block diagram of a text-to-speech system 

scriptions and prosody markers corresponding to re- in accordance with Apphcant's invention; 
spective strings of the words, and adds markers for FIG. 2 shows a basic format for syntactic information 

rhythm and stress to the strings, which are then con- and transcriptions of FIG. 1; 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

5 6 

FIG. 3 illustrates the keying of syntactic information frequently AVRB lfRi2kwYnt2Li in which 

and transcriptions to locations in the input text; "AVRB" is a grammatical tag indicating an adverb 

FIG. 4 shows a structure of a path in a synth— log in form. Each number in the succeeding phonetic tran- 

accordance with Applicant's invention; scription is a stress level for the following syllable. In a 

FIG. 5 shows a structure for transcription pointers 5 preferred embodiment of the invention, the highest 

and lotions in a synth— pointer_J)uffer in a TTS system stress level is assigned a value "1" and the lowest stress 

in accordance with Applicant's invention; level is assigned a value "4** although other assignments 

FIG. 6 shows a structure of prosody arrays produced are possible. It will be appreciated that linguists usually 

by a diphone-based prosody module in accordance with describe stress levels in the manner illustrated, i.e., 

Applicant's invention; 10 1= primary, 2= secondary, etc. As described in more 

FIG. 7 A is a flowchart of a process for generating the detail below, an Orthography-To-Phonetics (OTP) 

prosody arrays of FIG. 6; process is a part of the Dictionary Look-up Module 

FIG. 7B is a flowchart of a DiphoneNumber module; 

FIG. 7C is a flowchart of a process for constructing ^ contrast to prior TTS systems, Applicant's TTS 

a stdip table; 15 system considers stress (and phonetic accuracy in the 

FIG. 7D is a flowchart of a pull-stress-forward mod- presence of orthographic irregularities) to be so impor- 

jjjg. tant that it uses a large dictionary and reverts to other 

FIG. 8 illustrates pitch variations for questions in means « out a word or guessing at its 

gjjglkk pronunciation) only when the word is not found in the 

FIG. 9 is a flowchart of a coarticulation process in 20 main dictionary^ English dictionary preferably con- 
accordance with Applianfs invention; and tains about 12,000 roots or55,000 words, including .an 

FIGS. 10A-10E illustrate speech waveform genera- I f^ tlons ° f ^ t w ° rd V™ ensures that about 95% 

tion in accordance with Applicant's invention. of * ™ rds preSCnted t0 mpUt ^ * P ronounced 

correctly. 

DETAILED DESCRIPTION 25 The Dictionary Look-up Module 1010 repetitively 

Applicant's invention can be readily implemented in Word Dictionary 1020 for the input string 

**\ j as each character is entered. When an input string termi- 

computer program code t^t exammes mput text and a ^ ^ a or punctuation mark, such ^ is 

plurahty of suitably constructed lookup tables It will deemed to a P word and syn tactic information 

tterefore be ^appreciated that the mvenbon can be modi- 3Q md transcri tions 1030 for that character string is 

fied through changes to either or both of the program ^ to Grammar ^ Modules 1040> wmcn 

code and the ookup tab es. For example, appropriately detennine ±e grammatical role each word plays in the 

changing the lookup tables would allow the conversion sentence md &en select ^ pronunciation that corre- 

of mput text wntten m a language other than English. sponds t0 ^ grammatical role. This parser is de- - 

OVERVIEW of OPERATION 35 scribed m detail in Applicant's '966 patent, somewhat 

M „ • . . * modified to track the pronunciations associated with 

FIG. 1 is a high level block diagram of a TTS system 

1001 in accordance with Applicant's invention. Text Unl ike the parser described in Applicant's '966 pa- 
characters 1005, which may typically be in ASCII for- ^nt, [ t ^ not necessary for the TTS system 1001 to flag 
mat, are presented at an input to the TTS system. It will 40 spelling or capitalization errors in the input text, or to 
be appreciated that the particular format and source of pr0 vide help for grammatical errors. It is currently 
the input text does not matter, the input text might come preferred that the TTS system pronounce the text as it 
from a keyboard, a disk, another computer program, or & wr itten, including errors, because the risk of an im- 
any other source. The output of the TTS system 1001 is proper correction is greater than the cost of proceeding 
a digital speech waveform that is suitable for conversion 45 with errors. As described in more detail below, it is not 
to sound by a digital-to-analog (D/A) converter and necessary for the parsing process employed in Appli- 
loudspeaker (not shown). Suitable D/A converters and cant's TTS system to parse successfully each input sen- 
loudspeakers are built into MACINTOSH computers tence. If errors prevent a successful parse, then the TTS 
and supplied on SOUNDBLASTER cards for DOS- system can simply pronounce the successfully parsed 
type computers, and many others are available. 50 parts of the sentence and pronounce the remaining input 

As described in Applicant's above-incorporated '966 text word by word, 
patent, the input text characters 1005 are fed serially to As mentioned above, the Grammar Look-up Mod- 
the TTS system 1001. As each character is entered, it is ules 1040 are substantially similar to those described in 
stored in a sentence buffer 1060 and is used to advance Applicant's '966 patent For the TTS system, they carry 
the process in a Dictionary Look-up Module 1010, 55 along a parallel log called the "synth log", which main- 
which comprises suitable program code. The Dictio- tains information about the phonetic transcriptions asso- 
nary Look-up Module 1010 looks up the words of the ciated with the tags maintained in the path log. 
input text in a Word Dictionary 1020 and finds their A Phonetics Extractor 1080 retrieves the phonetic 
associated grammatical tags. Also stored in the Word transcriptions for the chosen path (typically, there is 
Dictionary 1020 and retrieved by the Module 1010 are 60 only one surviving path in the path log) from the dictio- 
phonetic transcriptions that are associated with the tags. nary. The pronunciation information maintained in the 
By associating the phonetic transcriptions, or pronunci- synth log paths preferably comprises pointers to the 
ations, with the tags rather than with the words, input places in the dictionary where the transcriptions reside; 
words having different pronunciations for different this is significantly more efficient than dragging around 
forms, such as nouns and verbs, can be handled cor- 65 the full transcriptions, which could be done if the mem- 
rectly. ory and processing resources are available. 

An exemplary dictionary entry for the word "fre- The Phonetics Extractor 1080 also translates some 
quently" is the following: text character strings, like numbers, into words. The 



02/23/2004, EAST version: 1.4.1 



5,384,893 

7 8 

Phonetics Extractor 1080 interprets punctuation as re- forced pauses (e.g., pauses at clause boundaries). Then 
quiring various amounts of pausing, and it deduces the the Waveform Generator 1120 proceeds diphone by 
difference between declarative sentences, exclamations, diphone through the Arrays 11 10, adjusting copies of 
and questions, placing the deduced information at the the appropriate diphone waveforms stored in a Diphone 
head of the string. As described further below, the Pho- 5 Waveform look-up table 1130 to have the pitch, ampli- 
netics Extractor 1080 also generates and places markers tude, and duration specified in the Arrays 1 110. Each 
for starting and ending various types of clauses in the adjusted diphone waveform is concatenated onto the 
synth log. The string 1090 of phonetic transcriptions end of the partial utterance until the entire sentence is 
and prosody markers are passed to a Prosody Generator completed. 

1 100. 10 It will be appreciated that the processes for synthesiz- 

The Prosody Generator 1100 has two major func- ing speech carried out by the Phonetics Extractor 1080, 
tions: manipulating the phonetics string to add markers Prosody Generator 1100, and Waveform Generator 
for rhythm and stress, and converting the string into a 1120 depend on the results of the parsing processes 
set of arrays having prosody information on a diphone- carried out by the Dictionary and Grammar Modules 
by-diphone basis. 15 1010, 1020 to obtain reasonably accurate prosody. As 

The term "prosody" refers to those aspects of a described in Applicant's '966 patent, the parsing pro- 
speech signal that have domains extending beyond indi- cesses can be carried out in real time as each character 
vidua! phonemes. It is realized by variations in duration, in the input text is entered so that by the time the punc- 
amplitude, and pitch of the voice. Among other things, tuation ending a sentence is entered the parsing process 
variations in prosody cause the hearer to perceive cer- 20 for that sentence is completed. As the next sentence of 
tain words or syllables as stressed. Prosody is sometimes the input text is entered, the synthesizing processes can 
characterized as having two parts: "intonation", which be carried out on the previous sentence's results. Thus, 
arises from pitch variations; and "rhythm", which arises depending on parameters such as processing speed, 
from variations in duration and amplitude. "Pitch" re- synthesis could occur almost in real time, just one sen- 
fers to the dominant frequency of a sound perceived by 25 tence behind the input Since synthesis may not be com- 
the ear, and it varies with many factors such as the age, pleted before the end of the next sentence's parse, the 
sex, and emotional state of the speaker. TTS system would usually need an interrupt-driven 

Among the other terms used in this application is speech output that can run as a background process to 
"phoneme" which refers to a class of phonetically simi- obtain quasi-real-time continuous output. Other ways of 
lar speech sounds, or "phones" that distinguish utter- 30 overlapping parsing and synthesizing could be used, 
ances, e.g., the /p/ and /t/ phones in the words "pin" _ „ Yr TO ^ 0 ^, TWHT ^ T _ ™™ 
and "tin". The term "aUophones" refer to the variant DETAILED DESCRIPTION OF OPERATION 
forms of a phoneme. For example, the aspirated /p/ of The embodiment described here is for English, but it 
the word "pit" and the unaspirated /p/ of the word will be appreciated that this embodiment can be adapted 
"spit" are aUophones of the phoneme /p/. "Diphones" 35 to any written language by appropriate modifications, 
are entities that bridge phonemes, and therefore include rar^rmxi 
the critical transitions between phonemes. English has DICTIONARY LOOK-UP 

about forty phonemes, about 130 aUophones, and about The term "dictionary" as used here includes not only 
1500 diphones. a "main" dictionary prepared in advance, but also word 

It can thus be appreciated that the terms "intonation", 40 lists supplied later (e.g., by the user) that specify pro- 
prosody", and "stress" refer to the listener's percep- nunciations for specific words. Such supplemental word 
tion of speech rather than the physical parameters of the lists would usually comprise proper nouns not found in 
speech. the main dictionary. 

The Prosody Generator 1100 also implements a Structure of Entries 
rhythm-and-stress process that adds some extra pauses 45 Each entry in the Word Dictionary 1020 contains the 
after highly stressed words and adjusts duration before orthography for the entry, its syntactical tags, and one 
and stress following some punctuation, such as commas. or more phonetic transcriptions. The syntactical tags 
Then it adjusts the rhythm by adding marks onto sylla- listed in Table.I of the above-incorporated '966 patent 
bles for more or less duration based on the stress pattern are suitable, but are not the only ones that could be 
of the syllables. This is called "isochrony". English and 50 used. In the preferred embodiment of Applicant's TTS 
some other languages have this kind of timing in which system, those tags are augmented with two more, called 
the stressed syllables are "nearly" equidistant in time "proper noun premodifier" (NPPR) and "proper noun 
(such languages may be called "stress timed"). In con- post-modifier" (NPPO), which permit distinguishing 
trast, languages tike Italian and Japanese use syllables of pronunciations of common abbreviations, such as "doc- 
equal length (such languages may be called "syllable 55 tor" versus "drive" for "Dr." As described above, the 
timed")- phonetic transcriptions are associated with tags or 

As described above, the Prosody Generator 1100 groups of tags, rather than with the orthography, so 
reduces the string of stress numbers, phonemes, and that pronunciations can be discriminated by the gram- 
various extra stress and duration marks on a diphone- matical role of the respective word, 
by-diphone basis to a set of Diphone and Prosody Ar- 60 Table I below lists several representative dictionary 
rays 1110 of stress and duration information. It also adds entries, including tags and phonetic transcriptions, and 
intonation (pitch contour) and computes suitable ampli- Table II below lists the symbols used in the transcrip- 
tude and total duration based on arrays of stress and tions. In Table I, the notation "(TAG1 TAG2)" speci- 
syntactic duration information. fies a pair of tags acting as one tag as described in Appli- 

A Waveform Generator 1 120 takes the information in 65 cant's '966 patent. In addition to symbols for one stan- 
the Diphone and Prosody Arrays 1 1 10 and adds "coar- dard set of phonemes, the transcriptions advantageously 
ticulation", i.e., it runs words together as they are nor- include symbols for silence (#) and three classes of 
mally spoken without pauses except for grammatically non-transcriptions (? , *, and * ). 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

9 10 

The silence symbol is used to indicate a pause in When a word in the input text is not found in the 
pronunciation (see, for example, the entry "etc." in Word Dictionary 1020, the TTS system 1001 either 
Table I) and also to delimit all transcriptions as shown spells out the characters involved (e g , for an input 
in Table I. The ? symbol is used to indicate entries that string "kot" the system could speak "kay oh tee") or 
need additional processing of the text to determine their 5 attempts to deduce the pronunciation from the charac- 
pronunciation. Accordingly, the ? symbol is used pri- ters present in the input text. For deducing a pronuncia- 
marily with numbers. In the dictionary look-up process, tion, a variety of techniques (eg., that described in the 
the digits 2-9 and 0 are mapped to a **2" for purposes of above-cited patent to Lin et al.) could be used, 
look up and the digit "1" is mapped to a "1". This re- In Applicant's TTS system 1001, the Word Dictio- 
duces the number of distinct entries required to repre- 10 nary 1020 is augmented with tables of standard suffixes 
sent numbers. In addition, it is desirable (as described and prefixes, and the Dictionary Look-up Module 1010 
below with respect to the Phonetics Extractor 10S0) to produces deduced grammatical tags and pronunciations 
pronounce numbers in accordance with standard En- together. In particular, the suffix table contains both 
glish (e.g., "one hundred forty-seven") rather than sim- pronunciations for endings and the possible grammati- 
ply reading the names of the digits (e.g., "one four IS cal tags for each ending. The Dictionary Look-up Mod- 
seven"), ule 1010 preferably deduces the syntax tags for words 

The * symbol is used to indicate a word for which not found in the Word Dictionary 1020 in the manner 
special pronunciation rules may be needed For exam- explained in the '966 patent 

pie, in some educational products it is desirable to spell The OTP process in the Dictionary Look-up Module 
out certain incorrect forms (e.g., "ain't"), rather than 20 1010 implements the following steps to convert un- 
give them apparently acceptable status by pronouncing known input words to phoneme strings having stress 
them. The * symbol is used as the transcription for marks. 

punctuation marks that may affect prosody but do not 1. Determine whether the word begins with "un" or 
have a phonetic pronunciation. Also, provisions for "non". If so, the appropriate phonetic string for the 
using triphones may be included, depending on memory 25 prefix is output to a convenient temporary storage area, 
limitations, because their use can help produce high- and the prefix is stripped from the orthography string in 
quality speech; phonetic transcription symbols for three the Sentence Buffer 1060. 

triphones are included in Table II. 2. Determine whether the word ends in "ate". In 

Choice of Phonemes English, this is a special case to track since the output in 

Lists of English phonemes are available from a vari- 30 the temporary storage area will have two pronuncia- 

ety of sources, e.g., D. O'Shaughnessy, Speech Com- tions with two tag sets in the form: 

munication, p. 45, Addison-Wesley (1987)(hereinafter VB<root>2et 

"O'Shaughnessy"). Most lists include about forty pho- and 

nemes. The list in Table II below differs from most ADJ NABS < root > 3Yt 

standard lists in having two unstressed allophones, **)" 35 For example, the word "estimate" has two pronuncia- 
and "Y", of stressed vowels and in having a larger num- tions, as in "estimate the cost" and "a cost estimate", 
ber of variants of liquids. In Table n, "R""x", and "L" Other languages may have their own special cases that 
are standard, but "r", "X", and "1" are added alio- can be handled in a similar way. A flag is set indicating 
phones. Also, Table II includes the additional stops **D" that all further expansion of the root pronunciation must 
and "T" for mid-word allophones of those phonemes. 40 expand both sections of the root 

It will be appreciated that the number of phonemes 3. Iteratively build up the pronunciation of the end of 
that should be used depends on the dialect to be pro- the word in the temporary storage area by comparing 
duced by the TTS system. For example, some dialects the end of the orthography in the Sentence Buffer 1060 
of American English include the sound "O" shown in to the suffix table, and, if a match is found: 
Table II and others use "a" in its place. The "O" might 45 a. stripping the suffix from the orthography; 
not be used when the TTS system implements a dialect b. outputting the appropriate phonetic 
that makes no distinction between "O" and "a". (The transcription and syntax tags found in the suffix table; 
particular symbols selected for the Table are somewhat and 

arbitrary, but were chosen merely to be easy to print in c. checking for the resulting root in the dictionary, 
a variety of standard printer fonts.) 50 This continues until either no more suffixes can be 

Table II also indicates three consonant clusters that found in the suffix table or the resulting root is found in 
have been used to implement triphones. In the interest the dictionary. The syntax tags included in the tempo- 
of saving memory, however, it is possible to dispense rary storage area are only those retrieved from the 
with the consonant clusters. suffix table for the first suffix stripped (ie., the last suffix 

Basic Look-up Scheme 55 in the unknown word). 

The process implemented by the Dictionary Look-up For example for the input string "preconformingly" 
Module 1010 for retrieving information from the Word the first "ly" suffix is stripped (and the syntax tag for an 
Dictionary 1020 is preferably a variant of the Phrase adverb is retrieved), and then the "ing" suffix is 
Parsing process described in the *966 patent in connec- stripped. The resulting root is "preconform", for which 
tion with FIGS. Sa-5c. In the TTS system 1001, dictio- 60 no more suffix stripping can be performed, and the 
nary characters take the place of grammatical tags and phonetic transcription information so far is: 
dictionary tags take the place of non-terminals. The < beginning missing >3iN3Li 
packing scheme for the Word Dictionary 1020 is simi- 4. Iteratively build up the pronunciation of the begin- 
larly analogous to that given for phrases in the '966 ning of the unknown word by matching the beginning 
patent It will be appreciated that other packing 65 to the entries in the prefix table, and, if a match is found: 
schemes could be used, but this one is highly efficient in a. outputting the pronunciation from the table; 
its use of memory. b. stripping the prefix; and 

Orthography-to-Phonetics (OTP) Conversion c. checking for the resulting root in the dictionary. 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

11 12 

This is done until no more prefixes can be found or until Tag" (because the process must skip a fixed number of 

the resulting root appears in the dictionary. The result- bytes to find the next tag) that indicates if this group of 

ing root for the foregoing example is "conform" as the tags is the end of the list of all tags for a given word, 

remaining orthography. This root is in the dictionary, This is illustrated in FIG. 3, which shows the relation- 

and the complete phonetics are: 5 ship between the word "record" in an input text frag- 

3pRJ2kanlform3iN3Li ment "The record was" the Syntactic Info & Transcrip- 

5. If step 4 fails, determine whether the remaining tf ons iQ30 t and a segment of the Word Dictionary 1020. 
orthography consists of two roots in the dictionary when a word was not found in the dictionary but had 
(e.g., "desktop"), and if so concatenate the pronuncia- its transcription deduced by the OTP process, the tran- 
tions of the two roots. Applicant's current OTP process 10 scrip tion pointer points to the transcription written by 
divides the remaining orthography after the first char- ^ QTp module m m convenient place in memory, 
acter and determmes whether the resulting two pieces ^ ^ mtin mt0 ^ dictionary proper. 

are roots in the dictionary; if not, the remaining orthog- T ^^l7 tfy a««i;,vWc ^«*«t4^ Xtw r« 0+ Wc 

pieces are examined. This procedure continues until 15 x °, , - ; . A , . A 

roots have been found or all possible divisions have " avm ^ 88 a ste P m converting them mto 

been checked word forms. Such expansion is intentionally left 

6. If step 5 fails, proceed to convert whatever remains ^ » Applicant's invention because numbers are 
of the root via letter-to-sound rules, viz., attempt to «««f *> Pf" 6 " gnunmatocaUy without all the extra 
generate a phonetic transcription for whatever remains m words > *ey become a single tag m Applicant's TTS 
according to very simple rules. system, rather than multiple tags that themselves re- 

When processing is completed, the Dictionary quire elaborate parsing. 
Lookup Module 1010 transfers the syntax tags and pho- Modifications for Other Languages 
netic transcriptions in the temporary storage area to the The first step in adapting the TTS system 1001 to 
Syntactic Info and Transcriptions buffer 1030. 25 another language is obtaining a suitable Word Dictio- 

Entries in the suffix table specify orthography, pro- nary 1020 having grammatical tags and phonetic tran- 
nunciation and grammatical tags, preferably in that scriptions associated with the entries as described 
order, and the following are typical entries in the suffix above. The tag system would probably differ from the 
table; English examples described here since the rules of 

30 grammar and types of parts of speech would probably 

: . be different. The phonetic transcriptions would also 

Suf 6 lAMk* adj nabs probably involve a different set of symbols since pho- 

ified 3Y40d adj vbd netics also typically differ between languages. 

ward 3wxd adj The OTP process would also probably differ depend- 

35 ing on the language. In some cases (like Chinese), it may 
Entries in the prefix table contain orthography and n <* possible to deduce pronunciations from orthog- 
pronunciations, and the following are typical entries: raphy. In other cases (like Spanish), phonetics may be 

so closely related to orthography that most dictionary 
entries would only contain a transcription symbol indi- 



a^tt 3ar4kY ^ eating that OTP can be used. This would save memory. 

^ 3^ On the other hand, it is believed that the dictionary 

extra 3Eks4tR) look-up process and output format described here 

would remain substantially the same for all languages. 

Structure of Output GRAMMAR LOOK-UP MODULES 

Although many OTP processes could-be used, the 45 - T , w , . 1A . rt . , 

output ofTsuitable process, Le., the syntactic informa- + The Grammar Look-up ^Modules 1040 operate sub- 
tin and phonetic transcriptions, must have a format those m the 9^patent, which pointed out 

that is identical to the format of the output from the locatlons '» ^ m P ut « f oUowed 

Word Dictionary 1020, i.e, grammatical tags and pho- ^^^ d + d ^f rocessm ^ As d ^ nbed f*™> 

netic transcriptions. 50 in the TTS system 1001 transcription pointers and loca- 

The Syntactic Information and Transcriptions 1030 is m ™ **°™ d - » ^ ** noted * e 

passed to the Grammar Modules 1040 and has a basic pointers and locations are not directly connected to the 

format as shown in FIG. 2 comprising one or more nonterminals, but their relationships can be deduced 

grammatical tags 1-N, a Skip Tag, and pointers to a from other information. For example, the text locations 

phonetic transcription and a location in the input text. 55 of nonterminals and phonetic transcriptions are known, 

This structure permits associating a pronunciation (cor- therefore the relationship between non-terminals and 

responding to the phonetic transcription) with a syntax transcriptions can be derived whenever needed, 

tag or group of tags. It also keys the tag(s) and tran- Structure of Grammar Tables 

scription back to a location (the end of a word) in the The Grammar Tables 1050 for the Phrase Dictionary, 

input text in the Sentence Buffer 1060. The key back to 60 Phrase Combining Rules, and Sentence Checking Rules 

the text is particularly important for transcriptions using described in the '966 patent in connection with FIG. 3a, 

the **?" symbol because later processes examine the text blocks 50-1, 50-2, and 50-3, are unchanged except for 

to determine the correct pronunciation. additions in the Phrase Dictionary to handle the proper- 

Since multiple pronunciations may be associated with noun pre- and post-modifier tags described above, 

different syntax tags for a single word (and indeed as 65 Structure of Synth Log 

explained in the '966 patent each word may have multi- The functions and characteristics of the Grammar 

pie syntax tags), the structure shown in FIG. 2 prefera- Path Data Area 70 shown in FIGS. 1 and 3a of the '966 

bly includes one bit in a delimiter tag called the "Skip patent are effectively duplicated in the TTS system 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

13 14 

1001 by a Grammar and Synth Log Data Area 1070 . info byte is set to 4 T* if this is the end of a sentence, 
shown in FIG. 1. For the TTS system, the Grammar and hence additional silence is required (and added) in 
Log is augmented with a Synth Log, which includes the output to separate sentences. The syntax — info byte 
one synth path for each grammar path. The structure of is set to "0" if this transcription string represents only a 
a grammar path is shown in FIG. Zb of the '966 patent 5 portion of a sentence (as would happen in the event of 
The structure of a corresponding synth path in the a total parsing failure in the middle of a sentence) and 
Synth Log is shown in FIG. 4. It is simply two arrays: should not have silence added, 
one of the pointers Trans 1 - TransN to transcriptions The syntax— info byte would be set to a predeter- 
needed in a sentence, and the other of the pointers Loci mined value, e.g., Hex80, for a word needing extra 
- Loc N to locations in the input text corresponding to 10 stress, e.g., the last word in a noun phrase. It will be 
each transcription. The synth path also contains a book- appreciated that extra stress could be added to all 
keeping byte to track the number of entries in the two nouns. This results in various changes in prosodic style, 
arrays. The TTS system need not add extra stress to any words, 

Grammar Module Processing but a mechanism for adding such additional stress is 

The processes implemented by the Grammar Mod- 15 to stress words according to their grammatical 

ules 1040 are modified from those described in the '966 roles. 

patent such that, as each tag is taken from the Syntactic u ' ±e grammar pat h log indicates that the nesting 
Info and Transcriptions area 1030, if it can be used level changes immediately prior to a particular word, 
successfully in a path, its transcription pointer and loca- then ^ $ynt3JLjn f 0 byte WO uld be set to another pre- 
lum are added to the appropriate arrays on the corre- 20 determined value , e .g., Hex40, if the sentence nested 
spending synth pa* m the Synth Log. At the same d at ^ mt syntax _info byte would be set 
tune, the num^of-entnes byte on that synth path is to ^ predetermined values, e.g. HexOl, Hex02, or 
updated. In effect, the process ; jmpiemented by the Hex03, /the sentence unvested by one, two, or three 
Grammar Modules 1040 does nothing with the phonetic levek respective j y 

trans^ptions/locations but track which ones are used 25 ^ S ynT.pointer_buffer is then examined by the 
"iW"? y> ? C %r °t ^ Phonetics Extractor one transcription pointer at a time. 

MoAficafcons for Other Languages transcription pointed to is a "?" \hen a transcrip- 

Besides any appropriate changes to the tagging sys- ^ A , * , *\ - • • *i. • * 

tern as described above, it is necessary to have a |ram- ? on ^ be deduced for a numeric string m the mput 
mar for the other language. The '966 patent givis the 30 *? transenpdon pointed to » a tten the 

necessary information for a competent linguist to pre- W0 5 d * * e . m P u text * to * sp ^ ed <^ ff f*" 
pare a grammar with the aid of a native speaker of the scnption pomted to is a * then the mput .text contains 
other languaee. a P^c^* 1011 that requires prosodic mterpreta- 

tion. If the syntax info byte is non-zero, stress ("[") or 
PHONETICS EXTRACTOR 35 destress ("TO markers must be added to the output 

Processing Scheme st™^ produced by the Phonetics Extractor and stored 

In the TTS system, the "best" grammar path in the m Phonetic String with Prosody Markers 1090. 
Grammar and Synth Path Log 1070 is selected by the Otherwise, the transcription retrieved from the Word 
Phonetics Extractor 1080 for further processing. This is Dictionary 1020 can be copied into the Phonetic String 
determined by evaluating the function 4*PthErr+Nest- 40 1090 directly. Certain other words, e.g., "a" and "the" 
Depth for each grammar path and selecting the path m m P ut text 0411 trigger the following special cases: 
with the minimum value of this function, as described in ^ pronunciation of the word "the" is changed from 
Applicant's '966 patent in connection with FIGS. 36, the default "q&& M to "qii" before words beginning with 
7a, and 7c, among other places. The variable PthErr vowels; and the pronunciation of the word "a" is 
represents the number of grammatical errors on the 45 changed from the default "&&" to "ee" if the character 
path, and the variable NestDepth represents the maxi- ™ the input text is uppercase ("A"), 
mum depth of nesting used during the parse. Other special cases involve the grammar stress mark- 

If two paths have the same "best" value, one is ers mentioned above. If the syntax-Jnfo byte is Hex80, 
chosen arbitrarily, e.g., the first one in the grammar the transcription is bracketed with the characters "[. . 
path log to have the best score. If all grammar paths 50 •]" to indicate more stress on that word. If the syntax- 
disappear, i.e., a total parsing failure, then the path log -info byte is Hex40, a destress marker ("J") is placed 
as it existed immediately prior to failure is examined before the word's transcription in the Phonetic String 
according to the same procedure. The parsing process 1090. If the syntax— info byte is in the range 1 to 3, that 
resumes immediately after the point of failure after the number of stress markers is placed before the word, i.e., 
portion preceding the failure has been synthesized. 55 *T\ "[["* or "[[[". 

The transcription pointers and locations for the iden- In the special case of input numeric character strings, 
tified path are copied by the Phonetics Extractor 1080 which might be simple numbers, dates, or currency 
to a synth_pointer_buffer which has a format as shown amounts, single digit numbers are looked up in a table of 
in FIG. 5. In the figure, TransPtrn is a transcription digit pronunciations, e.g., "2" — >ltuu##. Two-digit 
pointer, Locn is a location in the input text, and Sin is a 60 numbers are translated to "teens" if they begin with 
syntax-info byte. To flag the end of the list of transcrip- " 1 otherwise they are translated to a "ty " followed by 
tions, a final entry having the transcription pointer and a single digit, e.g., "37" - >!Qx2ti##lsE2vYn##. 
location both equal to zero is added. Three-digit blocks are translated to digit-hundred(s) 

The syntax-info byte added to each transcription/lo- followed by analysis of the final two digits as above, 
cation pair is determined by examining the selected path 65 Four-digit blocks can be treated as two two-digit 
in the grammar path log which contains information on blocks, e.g., 1984— >lnJn2tiin##lee2ti##lfor##. In 
nesting depth and non-terminals keyed to locations in large numbers separated by commas, e.g., 1,247,361, 
the input text. For the final entry in the list, the syntax- each block of digits may be handled as above and words 



02/23/2004, EAST Version: 1.4.1 



such as "million" and "thousand" inserted to replace 
the appropriate commas. 

In the special case of input numeric text preceded by 
a dollar sign, e.g., $xx.yy, the number string xx can be 
converted as above, then the pronunciation of "dollar" 5 
or "dollars" can be added followed by "and", then the 
yy string can be converted as above and the pronuncia- 
tion of "cents" added to the end. 

Spell-outs are another special case, i.e., transcriptions 
realized by * indicate that the input word must be 10 
spelled out character by character. Accordingly, the 
Phonetics Extractor 1080 preferably accesses a special 
table storing pronunciations for the letters of the alpha- 
bet and common punctuation marks. It also accesses the 
table of digit pronunciations if necessary. Thus, input IS 
text that looks like "A1-B4" is translated to: 
##lee##lw&n##lhJ2fYn##lbu##lfor##. Note 
that extra pause (silence) markers are added between 
the words to force careful and exaggerated pronuncia- 
tions, which is normally desirable for text in this form. 20 

In the TTS system 1001, punctuation marks have a 
" - " transcription. The various English punctuation 
marks are conveniently translated as follows, although 
other translations are possible: ' 

25 



period 


+ 1 sec of silence 


exclamation mark 


## + I sec of silence 


question mark 


## + § sec of silence 


& (ampersand) 


3And# 


colon 


#### 


semi-colon 


## + J sec of silence 


comma. 


#1# 


left paren 


##### 


right paren 


## 


quote mark 


## 


double hyphen 


### 



893 

16 

tional marks for stress and duration (rhythm). Then it 
converts the result to a set of Diphone and Prosody 
Arrays 1110 on a diphone-by-diphone basis, specifying 
each successive diphone to be used and associated val- 
ues of pitch, amplitude, and duration. 

Rhythm and Stress Processing 

The Prosody Generator 1 100 examines the informa- 
tion in the Phonetic String 1090 syllable by syllable 
(each syllable is readily identified by its beginning digit 
that indicates its lexical stress level). If a low-stress 
syllable (e.g., stress level 2, 3, or 4) is followed by a 
high-stress syllable (stress level 1 or 2) and the differ- 
ence in stress levels is two or more, the stress on the 
low-stress syllable is made one unit stronger. 

If a syllable is followed by the punctuation marker 
"|" (or a silence interval followed by "|") or if it is at 
the end of the string, the syllable is enclosed in curly 
braces ({. . .}) and the entire word in which that syllable 
appears is enclosed in curly braces. The curly braces, or 
other suitable markers, are used to force extra duration 
(lengthening) of the syllables they enclose. In addition, 
the next word after a " | " mark is enclosed in square 
brackets ([. . .]) to give it extra stress. 

The Prosody Generator 1 100 makes rhythm adjust- 
ments to the Phonetic String 1090 by exaniining the 
patterns of lexical stress on successive syllables. Based 
on the number of syllables having stress levels 2, 3, or 4 
that fall between succeeding syllables having stress 
level 1, the Prosody Generator brackets the leading 
stress-level-1 syllable and the intervening syllables with 
curly braces to increase or decrease their duration. The 
following bracketing scheme is currently preferred for 
English, but it will be appreciated that other schemes 
are suitable: 



Structure of Output 

The Phonetic String with Prosody Markers 1090 
generated by the Phonetics Extractor 1080 contains 
phonetic spellings with lexical stress levels, square 
bracket characters indicating grammatical changes in 
stress level, and the character "|" indicating punctua- 
tion that may need further interpretation. The string 
advantageously begins with the characters "xx" for a 
declaration, "xx?" for a question, and "xxT for an excla- 
mation. Such characters are added to the phonetics 
string based on the punctuation marks in the input sen- 
tence. The notation "xx?" is added to force question 
intonation later in the process; it is added only if the 
input text ends with a question mark and the question 
does not begin with "how" or a word starting with 
"wh'\ For example, if the input text is "Cats who eat 
fish don't get very fat" the Phonetic String 1090 is the 
following: 

xx #lkAts#]2hu#lit#lfiS#[ldont#lgEt#2ve3ril- 
fAt## 

It will be noted that the underlined segment is des- 
tressed because it is a subordinate clause that is identi- 
fied in the parsing process. 

Modifications for Other Languages 

The rules for special cases (currency, numbers, etc. ) 
and insertion of grammar marks for prosody will differ 
from language to language, but can be implemented in a 
manner similar to that described above for English. 

PROSODY GENERATOR 

The Prosody Generator 1100 first modifies the Pho- 
netic String with Prosody Marks 1090 to include addi- 



syllable 


number of 


syllable 


stress 


low stress 


bracketing 


pattern 


syllables 


pattern 


1 1 


0 


«!}} 1 


lal 


1 


{1} {{a»l 


lab 1 


2 


l{{a}}{{b»l 


label 


3 


1MMM1 


lab. x 1 


^4 


l»a{{}}b{{...}}x{{l 



45 For handling the stress level patterns at the beginning 
and end of a sentence, the Prosody Generator assumes 
that stress-level-1 syllables precede and follow the sen- 
tence. Finally, the Prosody Generator 1100 strips the 
vertical bars " | " from the phonetic string and processes 

50 the result according to a Diphone-Based Prosody pro- 
cess. 

Diphone-Based Prosody 

If the phonetic string begins with "xx" or "xx!" char- 
acters, the Diphone-Based Prosody process in the Pros- 

55 ody Generator 1 100 sets an intonation— mode flag for a 
declaration; if the Generator 1 100 determines that the 
string begins with "xx?" the intonation mode flag is set 
for a question. Also, if the string begins with "xx!" a 
pitch-variation-per-stress level (pvpsl) is set to a prede- 

60 termined value, e.g., 25%; otherwise, the pvpsl is set to 
another predetermined value, e.g., 12%. The purpose of 
the pvpsl is described further below. 

FIG. 6 shows the structure of the Prosody Arrays 
1110 generated by the Diphone-Based Prosody process 

65 in the Prosody Generator 1 100. The first arrays created 
from the Phonetic String are as follows: an array DN 
contains diphone numbers; an array LS contains the 
lexical stress for each diphone; an array SS contains the 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

17 18 

syntactic stress for each diphone; and an array SD con- 
tains the syntactic duration for each diphone. The other ^^W ^TSUl-minTS* l. 
arrays shown in FIG. 6 are described below. 

FIG. 7A shows the Prosody Generator process that The Prosody Generator 1 100 generates the amplitude 
converts a phonetic string pstr in the Phonetic String 5 arrayAM as a function of the total stress in the TS array 

with Prosody Markers 1090 into the arrays DN, LS, SS, according to the following equation: 

and SD. Using an index n into the string that has a _ t 

maximum value len(pstr), the process proceeds through 751/] 

the string, modifying a variable SStr representing cur- _.. c . ^ 
rent Syntactic Stress using the stress marks, modifying a 10 Tins results m an amplitude value of sxx een for the most 

variable SDur representing syntactic duration using the **My stressed Oiphones and (typicaUy) four for the 

duration marks, and setting a current value LStr of Xeas \ *P£ ones ' ^ ^rpretaUon of these val- 

lexical stress for all diphones in a syllable using the ue ii s "J 608 ** 1 ^ low - ^ ^ . 

lexical stress marks. As seen in the figure, the process A dura * on fa ? or jf m 5° account 
also accesses a program module caUed DiphoneNumber 15 desired speakmg rate, syntactically imr^sed vanations 

to determine the number assigned to each character pair m + duratl ^ K ^ * ue * k** 1 18 

that is a diphone determined by the Prosody Generator from the follow- 

FIG. 7B is a flowchart of the DiphoneNumber mod- hg reIationshi P : 

ule. The arguments to the DiphoneNumber module are Dftf]~i>ur+SD{f\+4-LS[f\ 
a string of phonetic characters pstr and an index n into 20 

that string. The module finds the first two characters in where Dur is a duration value (typically ranging from 

pstr at or after the index that are valid phonetic charac- zerQ tQ ^ establishes the overall speaking 

ters (i.e., characters that are not brackets L I or braces rate It jg currently preferred that the Prosody Genera- 

l, }, or digits, in this embodiment). It then searches the tor clamp ^ value DF rj] to ^ zer o to sixteen, 
list of diphones to determine if the pair is m the diphone ^ The final values stored in the first pitch array PI and 

inventory and if so it returns the diphone number as- second pitch array P2 are generated by the Prosody 

signed in the list to that diphone. If the pair is not a Generator 1100 based on four components: sentential 

diphone, the routine returns a negative number as an effects, stress effects, syllabic effects, and word effects, 

error indicator. The va!ues m th e pi array represent pitch midway 

The list of diphones employed by Applicant's current JU through each diphone and the values in the P2 array 

TTS system is given in Table IV below, and is stored in represent pitch at the ends of the diphones. 

a convenient place in memory as part of the Diphone The Prosody Generator 1 100 handles sentential pitch 

Waveforms 1130. The search of the diphone inventory effects by computing a baseline pitch value for each 

is rendered more efficient by preparing, before the first diphone based on the intonation mode. For declara- 

search, a table stdip giving the starting location for all tions, the baseline pitch values are advantageously as- 

diphones having a given first character. A flowchart of signed by straight-line interpolation from an initial ref- 

the process for constructing the stdip table is shown in erence value at the first diphone to a value at the last 

FIG. 7C. Referring again to FIG. 7A, if a character pair diphone about 9% lower than the initial value. For 

ab is not a diphone, the Prosody Generator replaces this questions, a suitable baseline for computing pitch values 

pair with the two pairs a# and #b. is shown in FIG. 8. The baseline reflects the typical 

As a result of this process, the first diphone in each form of English questions, in which pitch drops below 

word (which is of the form #a) carries stress and dura- a reference level on the first word and rises above the 

tion values from the preceding word. Accordingly as reference level on the last word of the sentence. It will 

shown in FIG. 7A, a program module makes another be appreciated that baselines other than straight-line 

pass through the arrays to pull the stress and duration interpolation or FIG. 8 would be used for languages 

values forward by one location for diphones beginning other than English. 

with the symbol #. A flowchart of the pull-stress-for- For stress effects, the Prosody Generator first initial- 
ward module is shown in FIG. 7D. izes the two pitch values Pi[j] and P2[j] for each di- 
The Prosody Generator 1 100 also finds the minimum phone to the same value, which is given by the follow- 
minSD of the entries in the SD array, and if minSD is mg equation: 
greater than zero, it normalizes the contents of the SD 

array according to the following equation: bastiine*pmod{pvpsl TS[fft 

SD[f]=SD[j)-minSD. where pvpsl is the pitch-variation-per-stress level de- 

55 scribed above and TSfj] is the total stress for the di- 

The other arrays shown in FIG. 6 are a total stress phone j. The baseline value is as described above, and 

array TS, an amplitude array AM, a duration factor the function pmod(pvpsl,TS[j]) is given by the follow- 

array DF, a first pitch array PI, and a second pitch mg table: 
array P2, which are generated by the Prosody Genera- 
tor 1 100 from the DN, LS, SS, and SD arrays. 60 

The total stress array TS is generated according to 
the following equation: 



TSUI 


pmod 


1 


1 + 2*pvpsl 


2 


1 + pvpsl 


3 




£4 


] — pvpsl 



65 

The Prosody Generator also finds the minimum minTS 

of the contents of the TS array, and normalizes those For syllabic effects, the Prosody Generator resets 
contents according to the following equation: Pifj] to the baseline value if the diphone begins with 



02/23/2004, EAST version: 1.4.1 



WAVEFORM GENERATOR 



5,384,893 

19 20 

silence or an unvoiced phoneme, and resets P2[j] to the used to produce the waveform from the closed-up Di- 
baseline value if the diphone ends with silence or an phone and Prosody Arrays, but a time-domain process 
unvoiced phoneme. is currently preferred to provide high speech quality 

Finally for word effects, the Prosody Generator de- with low computational power requirements. On the 
creases both Pi[fl and P2[j] by an amount proportional 5 other hand, this time-domain process incurs a subs tan- 
to their distance into the current word such that the tial cost in memory. For example, storage of high qual- 
total drop in pitch across a word is typically 40 hertz ity diphones for Applicant's preferred process requires 
(Hz) (in particular, 8 Hz per diphone if there are fewer approximately 1.2 megabytes of memory, and storage of 
than rive diphones in the word). For the final word in diphones compressed by the simple compression tech- 
the sentence, the drop is typically 16 Hz per diphone, 10 niques described below requires about 600 kilobytes, 
but is constrained to be no greater than 68 Hz for the It will be appreciated that Applicant's TTS system 
whole word. could use either a time-domain process or a frequency- 

Modifications for Other Languages domain process, such as LPC or formant synthesis. 

The Prosody Generator of Applicant's current TTS Techniques for LPC synthesis are described in chapters 
system could be modified to reflect the requirements of 15 8 and 9 of O'Shaughnessy, which are hereby incorpo- 
other languages. The rhythm adjustments based on rated in this application by reference, and in U.S. Pat 
syllable stress are needed only in some languages (e.g., No. 4,624,012 to Lin et al. Techniques for formant syn- 
English and German). Other languages (e.g., Spanish thesis are described in the above-incorporated chapter 9 
and Japanese) have all syllables of equal length; thus, a of O'Shaughnessy, in D. Klatt, "Software for a Cas- 
Prosody Generator for those languages would be sim- 20 cade/Parallel Formant Synthesizer", / Acoust Soc. of 
pier. The adjustments to duration and stress around Amer. vol. 67, pp. 971-994 (March, 1980), and in the 
punctuation marks and at the end of utterances are Malsheen et al. patent. With similar memory capacity 
probably language-dependent, as are the relationships available and substantially more processing power, an 
between amplitude, duration, pitch, and stress levels. LPC-based waveform generator could be implemented 
For example, in Russian pitch decreases when lexical 25 that could provide better speech quality in some re- 
stress increases, which is the opposite of English. spects than does the time-domain process. Moreover, an 

LPC-based waveform generator would certainly offer 
additional flexibility in modifying voice characteristics, 
In general, the Waveform Generator converts the as described in the Lin et al. patent, 
information in the Diphone and Prosody Arrays into a 30 Most prior synthesizers use a representation of the 
digital waveform that is suitable for conversion to audi- basic sound unit (phoneme or diphone) in which the 
ble speech by a diphone-by-diphone process. The raw sound segment has been processed to decompose it 
Waveform Generator also preferably implements a into a set of parameters describing the vocal tract plus 
process for "coarticulation", by which gaps between separate parameters for pitch and amplitude. The pa- 
words are eliminated in speech spoken at moderate to 35 rameters describing the vocal tract are either LPC pa- 
fast rates. Retaining the gaps can result in an annoying rameters or formant parameters. Applicant's TTS sys- 
staccato effect, although for some applications, espe- tern uses a time-domain representation that requires less 
daily those in which very slow and carefully articu- processing power than LPC or formants, both of which 
lated speech is preferred, the TTS system can maintain require complex digital filters, 
those gaps. 40 Lower quality time-domain processes can also be 

The coarticulation process might have been placed in implemented (e.g., any of those described in the above- 
the Prosody Generator, but including it in the Wave- cited Jacks et al. patent and U.S. Pat Nos. 4,833,718 and 
form Generator is advantageous because only one pro- 4,852,168 to Sprague). Such processes require substan- 
cedure call (at the start of waveform generation) is tially less memory than Applicant's approach and result 
needed rather than two calls (one for question intona- 45 in other significant differences in the waveform genera- 
tion and one for declaration intonation). Thus, the coar- tion process, 
donation process is, in effect, a "cleanup" mechanism About Diphones 

pasted onto the end of the prosody generation process. Diphones (augmented by some triphones, as de- 
FIG. 9 is a flowchart of the coarticulation process, scribed above) are Applicant's preferred basic unit of 
which generates a number of arrays that are assumed to 50 synthesis because their use results in manageable mem- 
have N diphones, numbered 0 to N-l. The predefined ory requirements on current personal computers while 
arrays FirstCharQ and SecondCharQ contain the first providing much higher quality than can be achieved by 
and second characters, respectively, ordered by di- phoneme- or allophone-based synthesis. The higher 
phone number. quality results because the rules for joining dissimilar 

Using the process shown, the Waveform Generator 55 sounds (e.g., by interpolation) must be very complex to 
1120 removes instances of single silences (#) separating produce natural sounding speech, as described in 
phonemes, appropriately smooths parameters, and O'Shaughnessy at pp. 382-385. 
closes up the Diphone and Prosody Arrays. If a se- An important feature of Applicant's implementation 
quence /a# #b/ provided to the coarticulation process is the storage of diphones having non-uniform lengths, 
cannot be reduced to a sequence /ab/ because the /ab/ 60 which is in marked contrast to other TTS systems. In 
sequence is not in the diphone inventory, then the se- Applicant's system, the diphones' durations are adjusted 
quence /a# #b/ is allowed to stand. Also, if /a/ and / V to correspond to length differences in vowels that result 
are the same phoneme, then the a# #b/ sequence is not from the natures of the vowels themselves or the con- 
modified by the Waveform Generator. texts in which they appear. For example, vowels tend to 
After the coarticulation process, the Waveform Gen- 65 be shorter before unvoiced consonants and longer be- 
erator proceeds to develop the digital speech output fore voiced consonants. Also, tense vowels (e.g., /i/, 
waveform by a diphone-by-diphone process. Linear /u/, /e/) tend to be longer than lax vowels (e.g., /I/, 
predictive coding (LPC) or formant synthesis could be /&/, /E/). These tendencies are explained in detail in 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

21 22 

many books on phonetics and prosody, such as I. Le- SAMP. The MARK area contains a list of the succes- 

histe, Suprasegmentals, pp. 18-30, MIT Press (1970). sive lengths of pitch intervals for each diphone. Voiced 

The duration adjustments in Applicant's implementa- intervals are given a positive length and unvoiced re- 

tion.are needed primarily for syntactically induced gions are represented by an integer giving the negative 

changes and to vary the speaking rate by uniform ad- 5 of the length. 

justment of all words in a sentence, not on a phoneme- The array DP contains, for each entry, information 

by-phoneme basis to account for phonetic context. Pho- giving the two phoneme symbols in the diphone, the 

neme- and allophone-based systems either must imple- location in the SAMP area of the waveform, the length 

ment special rules to make these adjustments (e.g., the in the SAMP area of the waveform, the location in the 

system described in the Malsheen et al. patent) or ignore 10 MARK area of the pitch intervals (or marks), and the 

these differences (eg., the system described in the Jacks number of pitch intervals in the MARK area. The di- . 

et al. patent) at the cost of reduced speech quality. phones are stored in the DP area in alphabetical order 

Structure of Stored Diphones by their character names, and the DP area thus consti- 

The currently preferred list of the stored diphones tutes the diphone inventory accessed by the Prosody 

used for an English TTS system is given in Table IV 15 Generator 1 100. 

below. In accordance with one aspect of Applicant's Certain diphones can uniformly substitute for other 

invention, the diphones are represented by signals hav- diphones, thus reducing the amount of data stored in the 

irig a data sampling rate of 1 1 kilohertz (KHz) because SAMP and MARK areas. The Waveform Generator 

that rate is something of a standard on PC platforms and performs such substitutions by making entries for the 

preserves all the phonemes from both males and females 20 second diphones in the DP array but using pointers in 

reasonably well. It will be appreciated that other sam- their blocks in the DP array that point to descriptions in 

pling rates can be used; for example, if the synthetic MARK and SAMP of some existing diphones. The 

speech is intended to be played only via telephone lines, currently preferred substitutions for English are listed 

it can be sampled at 8 KHz (which is the standard for in Table V below; in the Table, substitutions are indi- 

telephone transmission). Such down-sampling to 8 KHz 25 cated by the symbol >". 

would save memory and result in no loss of perceived There are two classes of substitutions: those in which 

quality at the receiver beyond that normally induced by the substitution results in only a minor reduction in 

telephonic transmission of speech. speech quality (e.g., substituting /V for /T/ in several 

The diphones are stored in the Diphone Waveforms cases); and those which result in no quality difference 

1 130 with each sample being represented by an eight-bit 30 (e.g., substituting /ga/ for /gJ/ does not reduce speech 

byte in a standard (mu-law) companded format. This quality because the /J/ diphone begins with an /a/ 

format provides roughly twelve-bit linear quality in a sound). 

more compact format. The diphone waveforms could Diphone-by-Diphone Processing 

be stored as eight-bit linear (with a slight loss of quality The Waveform Generator produces the digital 

in some applications) or twelve-bit linear (with a slight 35 speech output of the TTS system through a process that 

increase in quality and a substantial increase in memory proceeds on a diphone-by-diphone basis through the 

required). Diphone and Prosody Arrays 1110, constructing the 

One option in the current TTS system is the compres- segment of the speech output corresponding to each 

sion of the diphone waveforms to reduce the memory diphone listed in the DN array. In other words, for each 

capacity required. Simple adaptive differential pulse 40 index j in the array DNQ, the raw diphone described in 

code modulation (ADPCM) can reduce the memory the DP, MARK, and SAMP areas (at location DNQ] in 

required for waveform storage by roughly a factor of the DP area) is modified based on the information in 

two. Applicant's current TTS system implements AM[j], DF[j], Pi[j], and P2[j] to produce the output 

ADPCM (as described, for example, in O'Shaughnessy segment. 

at pp. 273-274, which are hereby incorporated in this 45 For each diphone j in the array DNQ], the Waveform 

application by reference) applied directly to the com- Generator performs the following actions, 

panded signal with a four-bit quantizer and adapting the 1. If the diphone was stored in a compressed format, 

step size only. It will be rioted that this reduces the it is decompressed. 

quality of the output speech, and since Applicant's cur- 2. Three points in the pitch contour of the diphone 
rent emphasis is on speech quality further data reduc- 50 are established: a starting point, a mid point, and an end 
tion schemes in this area have not been implemented. It point. The starting point is the end of the previous di- 
will be appreciated that many compression techniques phone's pitch contour, except on the first diphone for 
are well known both for time-domain systems (see, e.g., which the start is set as Pl[j]. The mid point is Pl[j], and 
the above-cited U.S. Patents to Sprague) and LPC sys- the end point is P2[fl. The use of three, pitch points 
terns (see O'Shaughnessy at pp. 358-375). 55 allows convex or concave shapes in the pitch contour 
While it is currently preferred that the diphone wave- for individual phonemes. In the following, if a diphone 
forms be stored in random access memory (RAM), the consists of an unvoiced region followed by voiced re- 
inventory could in various circumstances be stored in gions, only the pitch information from the mid to end 
read-only memory (ROM) or on another fast-access points is used. If it consists of voiced segments followed 
medium (e.g., a hard disk or a flash memory card), 60 by an unvoiced segment, only the information from the 
especially if RAM is very expensive in a given applica- start to the mid point is used. Otherwise, all three points 
tion but alternate cheap mass storage is available. As are used. Interpolation between the points is linear with 
described in more detail below, the diphones wave- each successive pitch interval. 

forms are stored in three separate memory regions 3. An estimate of the number of pitch periods actually 

called SAMP, MARK, and DP in the Diphone Wave- 65 needed from this diphone is made by dividing the length 

forms area 1 1 30. of all voiced intervals in the stored diphone by an aver- 

The raw waveforms representing each diphone are age pitch requested by the start, mid, and end pitch 

stored consecutively in memory in the area called values in voiced regions. 



02/23/2004, EAST Version: 1.4.1 



5,384,893 

23 24 

4. If more voiced intervals are needed than actually tor converts the samples to be added to linear form, 
exist in the stored diphone, the duration factor DFJj] is adds the converted samples, and converts the sums back 
adjusted by the following equation to force elongation to companded form. Standard tables for mnHn g such 
of the diphone: conversions are well known. The result of this process 

„„ JJ _ JW 5 is shown in FIG. 10E, which illustrates the raw signal 

Dm-mWV^-aetu^Vactua! ^ ^ ^ tQ processes that 

5. The Waveform Generator then steps through the ^P 1 * » to ? d ^ hon&s and di ! card ^ 
successive intervals (specified in the MARK area) de- truncated samples, Applicant's summation of overlap- 
fining the diphone and does the following for each 10 P m 6 adjacent intervals provides additional fidelity in 
interval: & e speech output signal, especially in those cases in 

a. For unvoiced intervals, the samples are copied, which significant energy occurs at the end of an inter- 
with modification only to ampUtude, to a storage val. 

area for the digital speech output signal, except as The above-described adjustments (especially for in- 
noted below for very high rate speech. ^ terval length) can result in annoying harshngftg in the 

b. For voiced intervals, the samples are copied, with speech output signal even when the interval marks have 
adjustment for both pitch and amplitude, to the been set so that the points for insertion and deletion of 
output signal storage area. signal fall in areas of lowest amplitude. Thus, on a sam- 
Duration adjustments are then made by examining p i e j>y sample basis, the Waveform Generator prefera- 

the duration factor DF[j] and a predefined table 2Q y y converts the companded signal (after amplitude and 

(given in Table IH) that gives drop/duplicate pitch adjustment) to a linear format and applies a digital 

patterns as a function of duration. The process fflter to s^th out the discontinuities introduced. The 

steps horizontally across the table on successive ^gital mter a ^ ^ (where Q h a le 

tst^ „ ^ * e nr putud f adjust ? *■ 

specified, the interval samples are copied again. 25 l ? *** soothed signal, given by the 

If drop is specified, counters are incremented to equation: 

skip the next interval in the stored diphone. , . , . , n _ 71 „, , . 

Finally, the pitch is interpolated linearly either *«J«*M+U5/iq xin-ij 

between start and mid pomte (for the first half of converted back to com- 

the diphone) or between mid and end points. JU , ^ . 1 & ~/ L J , " , "r ^ 

The Interval Copying Process ( mu * law ) form or left « a lmear signal, depend- 

The Waveform Generator adjusts ampUtude for both m S on * e out P ut *™5 J* ^ D/A °f n ~ 

voiced and unvoiced intervals by additively combining verter **** converts the digital speech output to analog 

the value of AMQ] with each companded sample in the form for reproduction, 

interval. In this adjustment, positive values of AM[j] are 35 Speed Mode 

added to positive samples and subtracted from negative F° r high rate speech, the Waveform Generator short- 
samples. Likewise, negative values of AM[j] are sub- unvoiced intervals during copying by removing 
traded from positive samples and added to negative 25% of the samples from the boundaries between un- 
samples. In both cases, if the resulting sample has a voiced sounds and by removing samples from the si- 
different sign from the original sample, the result is set 40 lence areas. For the voiced intervals, the above- 
to zero instead. This works because both the samples described interval-by-interval process referencing the 
and the AM values are encoded roughly logarithmi- Duration Array is used, but the Generator steps 
caUy. If the diphones are stored as linear waveforms, through every other voiced interval before applying the 
ampUtude adjustment would proceed by multiplication above logic, thereby effectively shortening voiced out- 
by suitably converted values of AM. 45 put segments by a factor of two compared to the output 

In copying voiced intervals, the desired interval in normal mode for the same duration factor, 

length is given by: The above-described techniques for modification of 

l/(desired_pitch) duration and pitch are appUcations to the current data 

For voiced intervals, if the desired length is greater than formats of well known time-domain manipulation tech- 

the actual length in the stored diphone interval, the 50 ^ guch M described * m D Malahj « Xime _ 

available samples are padded with samples having zero Dom ain Algorithms for Harmonic Bandwidth Reduc- 

l nX^t 7™ ™^i enSUL T1 f]? lllustrat « 1 tion and Time Scaling of Speech SignAs", IEEE Trans. 

by FIGS. 10A-10B FIG 10A represents three repeti- onAcoustiCt Speech * dS i^Proc^g Col ASSP-27, 

tions of a marked interval m an original stored diphone m / A i«™ JC f * . - 

waveform, and FIG, 10B represent padding of one 55 PP* WW33 (Apnl. 1979) and F. Lee Time ^mpres- 

of the original marked intervals to obtain a raw signal V? n ™* Ex^on of Speech by the Sampling 

with desktdlower pitch. Method", X Audio Engg Soc vol. 20, pp 738-742 (No- 

If the desired length is less than the actual length, the vember > i? 72 )- m 

number of original samples falling in the desired length So\m& Generation 

are taken for this interval; the remaining original sam- 60 The digital speech output waveform produced by the 

pies are not discarded but are added into the beginning Waveform Generator 1 120 may be immediately passed 

of the next interval. This is illustrated by FIGS. to a D/A converter or may be stored for later playback. 

10C-10E. FIG. 10C represents a marked interval in an Modifications for Other Speakers/Voices 

original stored diphone waveform, indicating a region Producing synthetic speech that sounds like a differ- 

of remaining samples to be added to the next interval, 65 ent speaker simply requires digitizing speech from that 

which is iUustrated by FIG. 10D. speaker containing aU diphones and extracting from the 

Since this summation must be performed on linear speech those segments that correspond to the diphones. 

rather than companded signals, the Waveform Genera- Modifications for Other Languages 



02/23/2004, EAST Version: 1.4.1 



25 



5,384,893 



Other languages naturally require their own diphone 
waveforms,- which would be obtainable from a native 
speaker of the language, drawn up according to the 
phonetic structure of that language. This portion of the 
TTS system's processing is substantially independent of 
the language; only the Diphone Waveforms 1120 need 
adaptation. 

TABLE I 

Typical Word Dictionary Entries 

record NCOM #lRE2kxr# VB #2Rilkord# 

invalid NCOM #lIn2v&3LYd# ADJ #2InlvA2LYd# 

1,212 NUMS #?# 

ain't (BEM NEG) (BER NEG) (BEZ NEG) #*# 

etc AVRB #lEt#l$Et3tx2R&# 

jump NCOM VB #lj&mp# 



26 

TABLE n-continued 



Phonetic Transcription Symbol Key 



TABLE II 



Phonetic Transcription Symbol Key 



Vowels 








a hOt 


AbAt 


ebAIt 


EbEt 


ibEEt 


I bit 


obOAt 


O bOUght 


u bOOt 


UpU&h 


&bUt 


) maxnA 


YcarrOt 


x beepER 


XtitLE 




Diphthongs 








J wire 


W grOUnd 


VbOY 




Glides 








y Yes 


w Wet 






Liquids 








RRoad 


r caR 


L Leap 


lfaLL 


Nasals 






m Man 
Stops 


n wheN 


NhaNG 




b Bet 


p Put 


dheaD 


D miDDle 


tTake 


TmeTal 


gGet 


kKit 



10 



Affricates 
j Jet 

Fricatives 
fFall 
s Save 
hHelp 

Consonant Ousters 
% SPend 
Other 

# silence 



c CHat 

v Very 
2 Zoo 



QbaTH 
SSHoot 



IS 



duration 
factor 



20 



25 



30 



S SKate @ STand 



? number 



• can't say 
word 



qbaTHe 
ZaZure 



"empty phone 



TABLE ni 



Contents of the Duration—Array 



16 


+ 1 


+1 


+1 


+1 


+1 


+1 


+1 


+1 


15 


+ 1 


+1 


+1 


+1 


0 


+1 


+1 


+1 


14 


+ 1 


+1 


0 


4-1 


+1 


+1 


0 


+1 


13 


+ 1 


+ L 


0 


+1 


+1 


0 


+1 


0 


12 


+ 1 


0 


+1 


0 


+1 


0 


+1 


0 


11 


+ 1 


0 


0 


+1 


0 


0 


+1 


0 


10 


+ 1 


0 


0 


0 


+1 


0 


0 


0 


9 


0 


0 


0 


+1 


0 


0 


0 


0 


8 


0 


0 


0 


0 


0 


0 


0 


0 


7 


0 


0 


0 


-1 


0 


0 


0 


0 


6 


-1 


0 


0 


0 


-1 


0 


0 


0 


5 


-1 


0 


0 


-1 


0 


0 




0 


4 


-1 


0 


-1 


0 


-1 


0 




0 


3 


-1 


-1 


0 


-1 


-1 


0 




0 


2 


-1 


-1 


0 


-1 


-1 


-1 




-1 


1 


-1 


-I 


-1 


-1 


0 


-1 




-1 


0 


-1 


-1 


-1 


-1 


-1 


-1 




-1 



TABLE IV 



List of Diphones 



## 


#& 


#) 


#@ 


#1 


#D 


#J 


#K 


#L 


#1 


#X 


#o 


#R 


#r 


#S 


#T 


#u 


#v 


#w 


#Y 


#a 


#b 


#e 


#d 


#e 


#f 


#& 


#h 


m 


#k 


#m 


#n 


#N 


#P 


#q 


#t 


#v 


#w 


#y 


0z 


#A 


#E 


#Q 


#z 


m 


#0 


#s 


#u 


#x 


&# 


&& 


&D 


&N 


&Q 


&S 


&T 


&L 


&R 


&r 


&x 


&h 


&w 


&b 


&c 


&d 


&f 


&g 


&i 


&k 


&1 


&X 




&n 


&p 


&q. 


&s 


&x 


&v 


&z 


)# 


)) 


)L 


)X 


)Q 


)R 


)r 


)x 


)S 


)T 


)b 


)c 


)d 




)g 


)h 


M 


)k 


)m 


> 


)P 




)t 


)v 


)w 


)D 


)N 


)1 


)q 


> 


@A 


AA 


AN 


AS 


AT 


Ab 


Ad 


Af 


Ak 


Al 


Am 


An 


Ap 


Aq 


As 


At 


Av 


Az 


A# 


AD 


AL 


AX 


AQ 


AZ 


Ac 


Ag 


Aj 


Ar 


AR 


Ax 


DS 


DY 


Di 


Dx 


D# 


D) 


DE 


DI 


DL 


DR 


Dr 


DX 


Da 


De 


Dl 


Dn 


Do 


E# 


ED 


EE 


EQ 


ES 


ET 


Ed 


Ef 


Ej 


Ek 


El 


Em 


En 


Ep 


Eq 


Er 


Ex 


Es 


Et 


Ev 


Ez 


EL 


EX 


ER 


EZ 


Eb 


Ec 


Eg 


I# 


ID 


n 


IN 


IQ 


IS 


rr 


IZ 


EL 


DC 


IR 


lb 


Ic 


Id 


If 


Ig 


Ij 


Ik 


n 


Ih 


Im 


In 


Ip 


Iq 


Ir 


Ix 


Is 


It 


Iv 


Iw 


Iz 


J# 


J) 


j& 


JA 


JD 


JL 


JQ 


JR 


JT 


JX 


JY 


Ja 


Jb 


Jd 


Jf 


Jg 


Ji 




Jk 


Jl 


Jm 


Jn 


Jo 


Jp 


Jq 


Jr 


Js 


Jt 


Jv 


Jw 


Jx 


Jy 


Jz 


L& 


L# 


L) 


LA 


LE 


U 


u 


LL 


LR 


Lr 


LX 


LI 


LO 


LU 


LV 


LW 


LY 


La 


Le 


Lf 


Li 


Lb 


Lc 


Ld 


Lg 


Lj 


Lk 


Ln 


Lm 


Lp 


Lq 


Ls 


U 


Lv 


Lw 


Ly 


Lz 


LD 


LQ 


LS 


LT 


Lo 


Lu 


Lx 


N# 


N& 


Nq 


NQ 


Ng 


Ni 


Nk 


Nt 


Nz 


N) 


N& 


NL 


Nl 


NX 


NS 


Nb 


Nd 


Nf 


Nh 


Nn 


Np 


Nx 


Nr 


NR 


Ny 


o# 


ON 


OO 


OS 


OT 


Oc 


Od 


Of 


Og 


Oi 


Ok 


OI 


On 


Op 


Os 


Ot 


Oz 


OY 


OI 


Ox 


OD 


OL 


OX 


OQ 


Ob 


Oj 


Or 


OR 


Q# 


Q& 


0) 


OA 


QE 


QO 


QR 


Qr 


QV 


QW 


QY 


Qi 


Qo 


Qs 


Qx 


0) 


Qi 


0J 


QL 


Ql 


QX 


Oa 


Qb 


Od 


Qe 


0/ 


Ok 


Qm 


Qn 


Qp 


Qt 


Qu 


Qw 


R# 


RD 


Rd 


Rf 


Rg 


Rj 


Rk 


Rm 


Rn 


Rs 


Rt 


Rz 


RQ 


RS 


Rb 


Rc 


Rp 


Rq 


Rv 


Rw 


RT 


RZ 


R& 


R) 


RA 


RE 


RI 


RJ 


RO 


RX 


RJL 


Rl 


RV 


RW 


RY 


Ra 


Re 


Ri 


Ro 


Ru 


RU 


Rx 


RR 


Rr 


S# 


SA 


SI 


SU 


SX 


SV 


SW 


SY 


Se 


Si 


So 


St 


Su 


Sx 


S& 


S) 


SE 


SJ 


SL 


SI 


SR 


Sr 


Sa 


Sb 


Sf 


Sh 


Sk 


Sm 


Sn 


Sv 


T# 


T) 


T& 


TX. 


TY 


Ti 


TI 


TL 


n 


Te 


Tm 


Tn 


To 


Ts 


Tx 


TR 


Tr 


w 


UT 


UR 


Ur 


Ux 


UU 


Ud 


Uk 


Ul 


UL 


UX 


Um 


Ut 


UD 


US 


Uc 


Ug 


Up 


Us 


v# 


V) 


V& 


VL 


VX 


VY 


VI 


Vb 


Vd 


Vf 


Vi 


Vk 


VI 


Vm 


Vn 


Vs 


Vt 


Vx 


VR 


Vr 


Vy 


Vz 


w# 


WD 


WE 


WI 


WQ 


WX 


WY 


Wb 


Wc 


Wd 


Wi 


Wl 


WL 


Wm 


Wn 


Wr 


WR 


Ws 


Wt 


Wx 


Wz 


x# 


x& 


X) 


Xf . 


XL 


XA 


XE 


X) 


XU 


XV 


XW 


XD 


XR 


Xr 


XO 


XQ 


xs 


Xa 


Xb 


Xc 


Xe 


Xj 


Xk 


Xo 


Xp 


Xq 


Xv 


Xw 


Xu 


Xy 


XX 


XI 


Xd 


Xt 


XI 


XJ 


XT 


XY 


Xg 


Xi 


Xm 


Xn 


Xs 


Xx 


Xz 


Y# 


YL 


YX 


YR 


YS 


YT 


YY 


Yb 


Yd 



02/23/2004, EAST Version: 1.4.1 



TABLE IV-continued 



list of Diphones 



Yf 


Yg 


Yj 


Yk 


VI 


Vm 


v« 
in 


Yp 


Ve 

XS 


Vr 
It 


v»» 

IV 


IX 


IZ 


VT% 
I Lf 


*X 


V7 

J£, 


xc 


Vli 
ID 


xr 


I w 


TV 


TI 
£A 




7Ji 
£ftr 


7% 
**) 


Zj6L. 








Zx 


1* 
£S 


7Xt 
£tt\ 


of) 




ab 


EC 


ad 


ag 


a J 


al 






ap 




aR 


EX 


OS 


at 


afr 


«T 
airf 


aX 


aN 


ay 


aS 


aT 


tZ 


af 


ah 


ak 


aq 


av 


ax 


b# 


b& 


bA 


bE 


bl 




KT 


KI 
DJ 


KO 
DU 


OK 


or 


KIT 
DU 


uv 
OA 


bV 


bW 


bY 


b) 


bd 


bh 


om 


v_ 

on 


Kb 

OS 


kn 

DU 


Dw 


Da 


K* 
DC 


Ki 

Dl 


kt 

"J 


DO 


kt 


ov 


DX 


oy 


OZ 




c& 


c\ 
W 


cV 


cE 


cL 


cl 


cO 


cX 


CD 


ca 


CI 


CO 


cm 


cq 


cn 


ct 




cl 


rT 

CJ 


cW 


cY 






rJ 
CI 


CO 


cu 






cy 


Off 


d& 


dS 


dX 


dY 


db 


df 


dh 


dk 


ai 


^ m 


Art 

an 


ap 


He 
OS 


At 


OV 


dw 


A\ 
**) 


dA 


AV 
oc 


dl 


dJ 


dL 


dO 


dR 


dr 


dV 


dW 


da 


de 


di 


do 


all 


j_ 
ax 


OZ 


eff 


eu 


eu 


eiN 


eS 


eT 


eb 


cc 


ed 


ec 


ei 


«J 


*Ir 
CJC 


Cl 


cm 


en 


ep 


CT 


cs 


ct 


ev 


ez 




est 


eQ 
Ik 


CK. 


ew 


CA 


ei 


Cl 


C£t 


ca 


CI 


eg 


eh 


CO 


cq 


ex 


w 


IA 


<T? 

Ub 


rw 
11 


LI 


rr 
11^ 


fl 


fD 
IK 


ir 


ft T 
IU 


fV 


IW 


IA 


rv 
II 


A 
*) 


IU 


*Q 


rr 
11 


fV 

XD 


»g 


fK 

in 


u 


fl- 


in 


*q 


IW 


fa 

ia 


ie 


fn 
10 


IS 


ft 
n 


IU 


Fr 
IX 


fv 

*y 


g* 


gA 


gc 


gl 




gl* 




gO 


gR 


gr 


gU 


gv 


«v 

gx 


g Y 


ga 


ge 


si 


gn 


go 


gw 


gx 


gy 


gz 


g# 


g) 


gw 


gz 


go 


gd 


gi 


gm 


SP 


qu 


uBL 


ti A 

nA 


kP 

nc 


hi 


hJ 


hO 


hV 


nw 


b) 


KTT 


LV 

ni 


Kn 

na 


K» 

ne 


Kl 


no 


Kn 

no 


nx 


kr 

nr 


hR 


hy 


i# 


9 


i& 


iA 


iD 


iL 


iN 


|Q 


it* 
ll 


IW 


li 


IT? 


;t 

ii 


U 


io 


iR 


iS 


iX 


iZ 


ia 


ib 


ih 


io 


ic 


id 


ie 


it 
it 




ii 


u 


ik 


U 


un 


in 


<P 


>q 


IT 


is 


it 


IV 


IW 


IX 


*y 


iz 


j# 


j& 


jA 


jE 


jJ 


jv 


rv 
J7 


ja 


jo 


j« 


jx 


jr 


JR 


J) 


j* 


JT 

JL 


Jl 


JO 


;y 
J x 


irl 

Jd 


J e 


ji 


jm 


jn 




Kff 




XJ 


Ir A 
KA 


kl 


kJ 


kL 


ki 


KU 


KK 


icr 


Kd 


KU 


KV 


KW 


t,v 
KA 


l/V 
KI 


Ira 

jca 


tr- 


ke 


ki 


kn 


ko 


KS 


Kt 


Kw 


KX 


*y 


KC 




Kl 


Ka 


KI 


Kg 


kk 


km 


kp 


kn 


i^t 
iff 


1& 


f\ 

1) 


1 A 

IA 


I'D 

IK 


ir 


ll 


i; 
11 


1- 
ua 


Kj 


In 

!P 


iq 


Is 


lv 


lw 


Ix 


iy 


lz 


ID 


1J 


1L 


U 


IX 


10 


1Q 


IS 


IY 


11 


lb 


lc 


Id 


le 


Ig 




lk 


In 


mA 


mL 


ml 


mR 


mr 


mf 


iph 


ink 


fnm 


mw 


m# 


m& 


m) 


mE 


ml 


mJ 


mO 


mQ 


mV 


mW 


mX 


mY 


ma 


mb 


md 


me 


mi 


mn 


mo 


mp 


mq 


ms 


mt 


mu 


. xnx 


my 


nu 


n& 


n) 


nA 


nD 


nE 


ni 


nJ 


aL 


ni 


nQ 


nS 


nV 


nW 


nX 


nY 


na 


nc 


nd 


ne 


nf 


ng 


ni 


nj 


nk 


am 


no 


ns 


nu 


nv 


nx 


ny 


nz 


n# 


nO 


nR 


nr 


nT 


oZ 


nb 


nh 


nn 


np 


nq 


nt 


nw 


o) 


o& 


oA 


oD 


oE 


ol 


oO 


oR 


oT 


oZ 


oa 


ob 


oe 


oh 


oj 


ow 


o# 


oL 


oX 


oQ 


oS 


oY 


oc 


od 


of 


og 


oi 


ok 


ol 


om 


on 


00 


op 


oq 


or 


OS 


ot 


ov 


ox 


OZ 


P# 


P& 


P) 


P A 


Pi 


PJ 


pL 


pO 


pR 


pr 


pU 


pv 


pW 


pX 


PY 


pa 


PC 


Pi 


Pi 


pm 


PO 


pq 


P$ 


Pt 


PX 


py 


pE 


pQ 


PS 


pT 


pc 


pd 


Pf 


ph 


P* 


on 


pu 


pw 


q) 


qL 


qX 


qi 


qY 


qd 


qx 


q# 


q& 


qA 


qW 


qE 


qi 


qc 


qi 


qo 


qx 


qR 


qr 


r# 


rD 


rE 


rL 


rl 


rT 


rX 


rY 


rf 


id 


rf 


rg 


ri 




rk 


rm 


rn 


rs 


it 


rz 


0 


r& 


rA 


rJ 


iO 


rV 


rW 


ru 


rU 


rZ 


rQ 


rS 


ra 


rb 


re 


re 


ro 


*P 


") 


rv 


rw 


rx 


IT 


rR 


s) 


sQ 


sR 


sr 


sS 


sT 


sb 


sd 


sf 


sg 


sh 


sj 


sn 


sp 


sq 


ss 


SW 


sy 


s# 


s& 


sA 


sE 


si 


sJ 


sL 


si 


sO 


sV 


SW 


sX 


sY 


sa 


sc 


se 


si 


sk 


sm 


so 


St 


su 


sx 


tA 


lc 


ti 


u 




*i 
u 


•o 


tp 

LK 


tx 


*TT 
IU 


tv 


tW 


tY 


La 


te 


ti 


tm 


tn 


to 


tq 


ts 


tu 


tw 


tx 


t# 


t& 


t) 


tX 


td 


tf 


tp 


uA 


ul 


uR 


uS 


uT 


uW 


ub 


uf 


ug 


ui 


uj 


uo 


uq 


ur 


us 


ut 


uw 


UX 


u# 


uD 


uE 


uL 


uQ 


uX 


uY 


uZ 


uc 


ud 


uc 


ok 


ul 


iim 


un 


up 


uu 


uv 


UZ 


v# 


vA 


vE 


vl 


vJ 


vO 


vR 


vr 


vV 


vW 


vX 


vY 


va 


vd 


ve 


vi 


vm 


vn 


vo 


vq 


vx 


vy 


vz 


v& 


v) 


vL 


vl 


vb 


w 


vw 


wA 


wW 


wX 


wl 


wL 


wu 


w& 


w) 


wE 


wl 


wJ 


wO 


wU 


wV 


wY 


wa 


we 


wi 


wo 


ws 


wx 


wr 


wR 


wz 


x# 


xL 


xQ 


xX 


xY 


xc 


xd 


xf 


xg 


xh 


xi 


xk 


xl 


xm 


xn 


xp 


xq 


XS 


xt 


XV 


XZ 


x> 


x& 


xA 


xD 


xE 


xi 


xJ 


xO 


xR 


XX 


xS 


xT 


xZ 


xa 


xb 


xe 


xj 


xo 


xw 


XX 


xV 


xW 


xu 


xU 


y) 


yA 


yi 


yo 


yv 


y& 


yE 


yx 


yi 


yL 


ya 


yi 


yo 


yu 


ys 


yx 


yR 


yr 


z# 


z& 


zA 


zE 


zl 


zJ 


zW 


zY 


za 


zb 


zd 


ze 


zi 


zn 


zq 


ZX 




zL 


zl 


zO 


zR 


zr 


zV 


zX 


zm 


zo 


zp 


zt 


zu 


ZW 















TABLE V 55 TABLE V-continued 



Diphone Substittttions , JPiphonc Substitutions 



#)-> #& 


m 


-> #s 


#D-> #d 


#J->#a 


)q- 


-> &q 


)z - 


-> &z 


AT-> At 


AD 


-> Ad 


#1 -> #L 


#x 


-> #L 


#0-> #a 


#r-> #R 


AX 


-> AL 


Ac 


-> At 


Aj -> Ad 


AR 


-> Ar 


#T-> #d 


#u 


-> #& 


#V-> #o 


S?W-> #A 


Ax 


-> At 


DY 


->dY 


Di -> di 


D# 


->d# 


#b -> #d 


#c 


-> #d 


#e->#E 


#g -> #d 


60 D) - 


-> d& 


DE 


-> dE 


DI -> dl 


DL 


-> dL 


#j -> #d 


#k 


-> #d 


#N -> #n 


#p -> #d 


DR 


-> dR 


Dr 


-> dR 


DX — > dX 


Da 


-> da 


#t-> #d 


#z 


-> #z 


&D -> &d 


AT -> &t 


De 


-> dE 


Dn 


— > dn 


Do — > do 


E# 


-> &# 


&L -> )L 


&r • 


-> &R 


&x -> &R 


&h -> )h 


ED 


->Ed 


ET 


-> Et 


Ej -> Ed 


Ex ■ 


->Er 


&w -> )w 


&c 


-> &t 


&j -> &d 


&X -> &J 


EX 


-> EL 


Ec 


-> Et 


I#->&# 


ID 


-> It 


))->&& 


)X- 


->)L 


)Q ->&Q 


)R -> &R 


IT - 


-> It 


DC 


-> IL 


Ic -> It 


Id - 


-> It 


)r->&R 


)*- 


-> &R 


)S -> &s 


)T -> &t 


65 Ij- 


>It 


a - 


-> Yh 


Ix -> Ir 


Iw - 


-> Yw 


)b -> &b 


)c- 


->&t 


)d ->&d 


)f->&f 


JD 


-> Jd 


JT • 


-> Jt 


Jj -> Jd 


L# 


->1# 


)g->&g 


)j- 


> &d 


)k -> &k 


)m — > &m 


L) - 


->L& 


U - 


-> La 


LL -> IL 


LR 


->1R 


)n -> &n 


)P- 


-> &p 


>->&s 


)t->&t 


Lr - 


->1R 


LX 


-> IL 


Ll — > LL 


LO 


-> La 


)v -> &v 


)D 


-> &d 


)N -> &N 


)!->&! 


LV 


->Lo 


LW 


-> LA 


Le-> LE 


Lf - 


-> If 



02/23/2004, EAST Version: 1.4.1 



29 

TABLE V-continued 



5,384,893 



30 

TABLE V-continued 



Diphone Substitutions 



Lb 


->Ib 


Lc 


-> lc 


Lj- 


-> Id 


Lk 


-> lk 


Lp 


->lp 


Lq 


->lq 


Lv 


->lv 


Lw 


-> lw 


LD 


-> ID 


LD 


->ld 


LT 


->Xt 


N& 


->N) 


Nr 


-> Nx 


NR 


-> Nx 


OS 


-> aS 


OT 


— > at 


Of 


-> af 


Og 


— > ag 


On 


— > an 


Op 


— > ap 


Oz 


— > az 


01 


-> OY 


OX 


— > aL 


OQ 


-> aQ 


Or 


— > ar 


OR 


— > ar 


OV 

>< * 


-> Op 


QW 


-> QA 


OJ 


_ Oa 


Ql - 


-> QL 


RD 


— > rD 


RD 


— > rd 


KB 


— P- *g 


Ri - 


-*> rd 


1U1 


— > rn 


Re . 


-> IS 


pn 


— > ry 


P3 




KC ■ 


— > It 


Pn 
Kp 


-> *P 


Rw 


— > rw 


DT 

KI 


— > rt 


t> f 

KJ • 


— ^ Km 


pn 


Pa 


KV 


Pn 


K.VV 


-s. PA 
— > KA 


P*- 


-> KX 




— > OO 


5) - 


Oft. 

- > «e 


QT 


- > &a 




— > w 


T) - 


-> t& 


Tn 


-> tn 


Ts - 


-> ts 




— > 


irr 


— > Ut 


Ux 


— > ux 


ULr 


>_ TT1 


Uc 


-> Ut 


V& 


-> V) 


Vr ■ 


-> Vx 


WD 


— > Wd 


WR 


— > Wr 


X& 


— > Lot 


XA 


— > LA 


XE 


— > Lb 


XV 


-> Lo 


xw 


— > LA 


Xr ■ 


->1R 


xo 


-> 10 


Xa 


->La 


Xb • 


-> lb 


Xj- 


->ld 


Xk • 


-> Ik 


Xq 


->lq 


Xv • 


— > IV 


Xy 


->iy 


AA 


— > AX/ 


Y# 


->&# 


VY 




Yx 


->Yr 


vn 


— > IQ 


Z& 


->2) 


*<r - 


-> Z.X 


ac - 


-> at 


a J - 


■> ad 


aX 


-> aL 


aT - 


-> at 


bO 


-> ba 


br - 


-> OK 


b)- 


->b& 


Din ■ 


— > Djf 


bt-> b# 


cV ■ 




cJ -> ca 


cW 




dS -> DS 


dl - 




dJ - 


-> da 


ACS 


— > oa 


dW 


->dA 


ac - 


HP 
- OX1 


ec - 


-> et 


ej — 




0 - 


■> fa 


n — 




fW 


->fA 




eg. 


fe - 


> fE 


gJ - 


-> ga 


gr - 


->gR 


gv ■ 


— > go 


g)- 


-> g& 


© 




hO 


->ha 


hv - 




he - 


-> faE 


hr - 




rr- 


-> it 


iW - 


— ^ in. 


ic - 


> it 




> iE 


jv - 


-> jo 


• 

jr — 


J A 


jl- 


>jL 


iO - 


_ *s. ia 


U - 


->ka 


kl — 




kV 


->ko 


kW 


— ^ 


IcT 


->kt 


Irrr - 
Kg - 




lr - 


>IR 


ID - 




IX- 


->IL 


U - 


>1Y 


ml - 


-> mL 


mr - 


-> mR 


mO 


— > ma 


mV 


— > mo 


n)- 


-> n& 


nD 


-> nd 


nV 


-> no 


nW 


-> nA 


nJ - 


-> ad 


nO • 


-> na 


oD 


->od 


oO 


-> oa 


oj - 


> od 


oX 


-> oL 


pj- 


->pa 


pO 


-> pa 


pW 


->pA 


pc - 


->pE 


PT 


->pt 


pc - 


->pt 


qX 


->gL 


ql- 


> gL 


qR 


->QR 


qr - 


*> QR 


iT- 


-> rt 


ri - 


>rY 



Ld -> Id 
Ln — > In 
Ls-> Is 
Ly-> Iy 
LQ->IQ 
Nl — > NL 
OP -> a# 
Oc — > at 
Ok -> ak 
Os -> as 
OD -> ad 
Ob — > ab 
QO -> Qa 
Qx -> qx 
Qe->QE 
Rd -> rd 
Rk -> rk 
Rt -> rt 
Rb -> rb 
Rq -> rq 
RZ -> xZ 
RL-> RX 
Re — > RE 
SW -> SA 
SI -> SL 
T&-> t& 
TR -> tR 
UR -> uR 
UX — > Ul 
VI -> VY 
Wc -> Wt 
X)-> L& 
X) -> La 
XD -> Xd 
XQ->1Q 
Xc -> lc 
Xo -> Lo 
Xw -> lw 
XI — > XL 
YT-> Yt 
Yc -> Yt 
ZR — > Zx 
aR — > ar 
W -> ba 
bV -> bo 
be -> bE 
cl -> cL 
ce -> cE 
dw-> d# 
dr -> dR 
eD -> ed 
e&->e) 
fr->£R 
fO->£a 
gl->gL 
ge->gE 
gP -> g# 
hW -> hA 
hR-> hx 
U _> ia 
ij -> id 
JR -> jx 
je-> jE 
kO -> ka 
kc -> kt 
1)->1& 
U -> 1& 
lg->l# 
m) — > mSc 



Lg->I# 
Lm — > fan 
U ->Xt 
Lz -> Iz 
LS->1S 
NX — > NL 
ON -> aN 
Od ->ad 
Ol ->al 
Ot -> at 
OL -> aL 
Oj -> ad 
Qr -> QR 
Q)->Q& 
R# ->r# 
Rf-> rf 
Rm — > rm 
Rz -> rz 
Re — > rc 
Rv -> rv 
R) -> R& 
R1->RX 
RR — > Rx 
Se-> SE 
Sr -> SR 
Tl -> TL 
Tr -> tR 
Ur -> ur 
UD -> Ud 
VR -> Vx 
WL -> WI 
Xf-> If 
XU -> LU 
XR-> 1R 
XS->1S 
Xe-> LE 
Xp-> hp 
Xu -> Lu 
XT-> Xt 
Yj -> Yd 
ZI -> ZY 
aD — > ad 
ax — > ar 
W -> bL 
bW -> bA 
bj ->b# 
cO — > ca 
cr -> kR 
d) -> d& 
dV->do 
cT -> et 
el ->eY 
fV -> fo 
fT->ft 
gO ->ga 
gj -> gd 
U -> ha 
h) -> h& 
iD->id 
iO -> ia 
jJ-> ja 
j)->j& 
k)-> k& 
kr -> kR 
ke-> kE 
IA -> LA 

n — > ll 

lj ->U 
mJ -> i 



mW — > mA me — > mE 

nl— >na nl — > nL 

nc — > nt ne — > oE 

nr->nR nT -> nt 

oT->ot oe->oE 

oc -> ot p)->p& 

pr->pR -" ~ ' 



pi -> pL 
pd->p# 
qW -> qA 
rD -> rd 
rj ->rd 



pV -> po 
pt-> p# 
q) -> q& 
qe -> qE 
ri -> rL 
rJ - > ra 



15 



20 



25 



35 



40 



45 



50 



55 



60 



65 



Diphone Substitutions 


rO - 


-> ra 


rV 


— > ro 


rW — > rA 


ru - 


-> Ru 


rU - 


-> RU 


rZ • 


-> xZ 


rc — > rt 


re - 


-> rE 


rr — 


> rx 


rR • 


-> rx 


s) -> s& 


sr - 


-> sR 


sT- 


-> St 


sj - 


•> sd 


sJ — > sa 


si - 


■> sL 


sO - 


-> sa 


SV ■ 


— > so 


SW — > SA 


sc - 


-> St 


se - 


>sE 


sk - 


-> sg 


tJ — > ta 


tl - 


>tL 


tO -> ta 


tr - 


-> tR 


tV — > to 


tw 


-> tA 


te - 


> tE 


v - 


■> tat 


uT — > ut 


uW 


— > uA 


uj - 


> ud 


uD 


— > ud 


uc — > ut 


ue - 


-> uE 


vJ - 


> va 


vO 


— > va 


vr -> vR 


vV 


— > vo 


vW 


-> vA . 


ve - 


-> vE 


v) -> v& 


vl - 


-> vL 


wW 


-> wA 


wl • 


-> wX 


wL -> wX 


w)- 


-> w& 


wJ - 


-> wa 


wO 


— > wa 


wV — > wo 


we 


-> wE 


wr - 


- > wx 


wR 


— > wx 


xc -> x# 


xg - 


->x# 


x& - 


->x) 


xD 


-> xd 


xJ — > xa 


xO 


-> xa 


xr - 


>xR _ 


xT 


-> xt 


xe — > xE 


xj -> xd 


xV -> RV 


xW 


->RW 


xu — > Ru 


xU 


-> RU 


y>- 


> y& 


yA 


-> yE 


yO -> ya 


yl->yx 


yL->yX 


yR 


->y* 


yr-> yx 


zJ - 


-> za 


zW 


->zA 


ze - 


->zE 


z) -> z& 


zl -> zL 


zO -> za - 


zr - 


-> zR 


zV -> zo 







The foregoing description of the invention is in- 
tended to be in all senses illustrative, not restrictive. 
Modifications and refinements of the embodiments de- 
scribed will become apparent to those of ordinary skill 
in the art to which the present invention pertains, and 
those modifications and refinements that fall within the 
spirit and scope of the invention, as defined by the ap- 
30 pended claims, are intended to be included therein. 
What is claimed is: 

1. A system for synthesizing a speech signal from 
strings of words, comprising: 
means, for entering into the system strings of charac- 
ters comprising words; 
a first memory, wherein predetermined syntax tags 
are stored in association with entered words and 
phonetic transcriptions are stored in association 
with the syntax tags; 
parsing means, in communication with the entering 
means and the first memory, for grouping syntax 
tags of entered words into phrases according to a 
first set of predetermined grammatical rules relat- 
ing the syntax tags to one another and for verifying 
the conformance of sequences of the phrases to a 
second set of predetermined grammatical rules 
relating the phrases to one another, wherein the 
sequences of the phrases correspond to the entered 
words; 

first means, in communication with the parsing 
means, for retrieving from the first memory the 
phonetic transcriptions associated with the syntax 
tags grouped into phrases conforming to the sec- 
ond set of rules, for translating predetermined 
strings of entered characters into words, and for 
generating strings of phonetic transcriptions and 
prosody markers corresponding to respective 
strings of entered and translated words; 
second means, in communication with the first means, 
for adding markers for rhythm and stress to the 
strings of phonetic transcriptions and prosody 
markers and for converting the strings of phonetic 
transcriptions and prosody markers into arrays 
having prosody information on a diphone-by- 
diphone basis; 
a second memory, wherein predetermined diphone 
waveforms are stored; and 



02/23/2004, EAST Version: 1.4.1 



31 



5,384,893 



32 



10 



third means, in communication with the second 
means and the second memory, for retrieving di- 
phone waveforms corresponding to the entered 
and translated words from the second memory, for 
adjusting the retrieved diphone waveforms based 5 
on the prosody information in the arrays, and for 
concatenating the adjusted diphone waveforms to 
form the speech signal. 

2. The synthesizing system of claim 1, wherein the 
first means interprets punctuation characters in the en- 
tered character strings as requiring various amounts of 
pausing, deduces differences between entered character 
strings having declarative, exclamatory, and interroga- 
tive punctuation characters, and places the deduced 15 
differences in the strings of phonetic transcriptions and 
prosody markers. 

3. The synthesizing system of claim 2, wherein the 
first means generates and places markers for starting 
and ending predetermined types of clauses in a synth 20 
log. 

4. The synthesizing system of claim 1, wherein the 
second means adds extra pauses after highly stressed 
entered words, adjusts duration before and stress fol- 
lowing predetermined punctuation characters in the 25 
entered character strings, and adjusts rhythm by adding 
marks for more or less duration onto phonetic transcrip- 
tions corresponding to selected syllables of the entered 
words based on a stress pattern of the selected syllables. 

5. The synthesizing system of claim 1, wherein the 30 
third means adjusts the retrieved diphone waveforms 
for coaiticulation. 

6. The synthesizing system of claim 1, wherein the 
parsing means verifies the conformance of a plurality of 
parallel sequences of phrases and phrase combinations 
to the second set of grammatical rules, each of the plu- 
rality of parallel sequences comprising a respective one 
of the sequences possible for the entered words. 

7. In a computer, a method for synthesizing a speech ^ 
signal by processing natural language sentences, each 
sentence having at least one word, comprising the steps 
Of: 

entering a sentence; 

storing the entered sentence; 45 
finding syntax tags associated with the words of the 

stored entered sentence in a word dictionary; 
finding in a phrase table non-terminals associated 
with the syntax tags associated with the entered 
words as each word is entered; 50 



35 



tracking, in parallel as the words are entered, a plural- 
ity of possible sequences of the found non-termi- 
nals; 

verifying the conformance of sequences of the found 
non-terminals to rules associated with predeter- 
mined sequences of non-terminals; 

retrieving, from the word dictionary, phonetic tran- 
scriptions associated with the syntax tags of the 
entered words of one of the sequences conforming 
to the rules; 

generating a string of phonetic transcriptions and 
prosody markers corresponding to the entered 
words of said one sequence; 

adding markers for rhythm and stress to the string of 
phonetic transcriptions and prosody markers and 
converting said string into arrays having prosody 
information on a diphone-by-diphone basis; 

retrieving, from a memory wherein predetermined 
diphone waveforms are stored, diphone wave- 
forms corresponding to said string and the entered 
words of said one sequence; 

adjusting the retrieved diphone waveforms based on 
the prosody information in the arrays; and 

concatenating the adjusted diphone waveforms to 
form the speech signal. 

8. The synthesizing method of claim 7, wherein the 
generating step comprises the steps of interpreting 
punctuation characters in the entered sentence as re- 
quiring corresponding amounts of pausing, deducing 
differences between declarative, exclamatory, and in- 
terrogative sentences, and placing the deduced differ- 
ences in the string of phonetic transcriptions and pros- 
ody markers. 

9. The synthesizing method of claim 8, wherein the 
generating step includes placing markers for starting 
and ending predetermined types of clauses in a synth 
log. 

10. The synthesizing method of claim 7, wherein the 
adding step comprises the steps of adding extra pauses 
after highly stressed entered words, adjusting duration 
before and stress following predetermined punctuation 
characters in the entered sentence, and adjusting 
rhythm by adding marks for more or less duration onto 
phonetic transcriptions corresponding to selected sylla- 
bles of the entered words based on a stress pattern of the 
selected syllables. 

11. The synthesizing method of claim 7, wherein the 
adjusting step comprises adjusting the retrieved di- 
phone waveforms for coaiticulation. 



55 



60 



65 



02/23/2004, EAST version: 1.4.1 



